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O dear Ophelia! 
I am ill at these numbers: 
I have not art to reckon my groans. 

—Hamlet (Act II, Sc. 2, Line 120) 


The algorithms discussed in this book deal directly with numbers; yet I 
believe they are properly called seminumerical, because they lie on the borderline 
between numeric and symbolic calculation. Each algorithm not only computes 
the desired answers to a problem, it also is intended to blend well with the 
internal operations of a digital computer. In many cases a person will not be 
able to appreciate the beauty of such an algorithm unless he or she also has some 
knowledge of a computer’s machine language; the efficiency of the corresponding 
machine program is a vital factor that cannot be divorced from the algorithm 
itself. The problem is to find the best ways to make computers deal with numbers, 
and this involves tactical as well as numerical considerations. Therefore the 
subject matter of this book is unmistakably a part of computer science, as well 
as of numerical mathematics. 

Some people working in “higher levels” of numerical analysis will regard the 
topics treated here as the domain of system programmers. Other people working 
in “higher levels” of system programming will regard the topics treated here as 
the domain of numerical analysts. But I hope that there are a few people left who 
will want to look carefully at these basic methods; although the methods reside 
perhaps on a low level, they underlie all of the more grandiose applications of 
computers to numerical problems, so it is important to know them well. We are 
concerned here with the interface between numerical mathematics and computer 
programming, and it is the mating of both types of skills that makes the subject 
so interesting. 

There is a noticeably higher percentage of mathematical material in this 
book than in other volumes of this series, because of the nature of the subjects 
treated. In most cases the necessary mathematical topics are developed here 
starting almost from scratch (or from results proved in Volume 1), but in some 
easily recognizable sections a knowledge of calculus has been assumed. 
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PREFACE 


This volume comprises Chapters 3 and 4 of the complete series. Chapter 3 is 
concerned with “random numbers”: it is not only a study of various methods for 
generating random sequences, it also investigates statistical tests for randomness, 
as well as the transformation of uniform random numbers into other types of 
random quantities; the latter subject illustrates how random numbers are used 
in practice. I have also included a section about the nature of randomness itself. 
Chapter 4 is my attempt to tell the fascinating story of what mankind has been 
able to learn about the processes of arithmetic, after centuries of progress. It 
discusses various systems for representing numbers, and how to convert between 
them; and it treats arithmetic on floating point numbers, high-precision integers, 
rational fractions, polynomials, and power series, including the questions of 
factoring and finding greatest common divisors. 

Each of Chapters 3 and 4 can be used as the basis of a one-semester college 
course at the junior to graduate level. Although courses on “Random Numbers” 
and on “Arithmetic” are not presently a part of many college curricula, I believe 
the reader will find that the subject matter of these chapters lends itself nicely 
to a unified treatment of material that has real educational value. My own ex¬ 
perience has been that these courses are a good means of introducing elementary 
probability theory and number theory to college students; nearly all of the topics 
usually treated in such introductory courses arise naturally in connection with 
applications, and the presence of these applications can be an important motiva¬ 
tion that helps the student to learn and to appreciate the theory. Furthermore, 
each chapter gives a few hints of more advanced topics that will whet the appetite 
of many students for further mathematical study. 

For the most part this book is self-contained, except for occasional discus¬ 
sions relating to the MIX computer explained in Volume 1. Appendix B contains a 
summary of the mathematical notations used, some of which are a little different 
from those found in traditional mathematics books. 

In addition to the acknowledgments made in the preface to Volume 1, 
I would like to express deep appreciation to Elwyn R. Berlekamp, John Brillhart, 
George E. Collins, Stephen A. Cook, D. H. Lehmer, M. Donald MacLaren, 
Mervin E. Muller, Kenneth B. Stolarsky, and H. Zassenhaus, who have gener¬ 
ously devoted considerable time to reading portions of the preliminary manu¬ 
script, and who have suggested many valuable improvements. 

Princeton , New Jersey D. E. K. 

October 1968 


Preface to the Second Edition 

My first plan, when beginning to prepare this new edition, was to make it like 
the second edition of Volume 1: I went through the entire book and tried to 
improve every page without greatly perturbing the page numbering. But the 
number of improvements turned out to be so great that the entire book needed 
to be typeset again. As a result, I decided to make this book the first test case 
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for a new computer typesetting system I have been developing. I hope that most 
readers will like the slight changes in format, since my aim has been to produce 
a book whose typography is of the highest possible quality—superior even to the 
fine appearance of the previous editions, in spite of the fact that a computer 
is now involved. If all goes well, the third edition of Volume 1 and the second 
edition of Volume 3, and all editions of Volumes 4 through 7, will be published 
in the present style. 

The decision to reset this entire book has freed me from the shackles of 
the previous page numbering, so I have been able to make major refinements 
and to insert a lot of new material. I estimate that about 45 percent of the 
book has changed. I did try, however, to keep the exercise numbers from being 
substantially altered; although many of the old exercises have been replaced by 
new and better ones, the new exercises tend to relate to the same idea as before. 
The explosive growth of seminumerical research in recent years has of course 
made it impossible for me to insert all of the beautiful ideas in this field that 
have been discovered since 1968; but I think that this edition does contain an 
up-to-date survey of all the major paradigms and basic theory of the subject, 
and it seems reasonable to believe that very few of the topics discussed here will 
ever become obsolete. 

The National Science Foundation and the Office of Naval Research have been 
particularly generous in their support of my research as I work on these books. I 
am also deeply grateful for the advice and unselfish assistance of many readers, 
too numerous to mention. In this regard I want to acknowledge especially the 
help of several people whose contributions have been really major: B. I. Aspvall, 
R. P. Brent, U. Dieter, M. J. Fischer, R. W. Gosper, D. C. Hoaglin, W. M. Kahan, 
F. M. Liang, J. F. Reiser, A. G. Waterman, S. Winograd, and M. C. Wunderlich. 
Furthermore Marion Howe and other people in the Addison-Wesley production 
department have been enormously helpful in untangling literally thousands of 
hand-written inserts so that a very chaotic manuscript has come out looking 
reasonably well-organized. I suppose some mistakes still remain, or have crept 
in, and 1 would like to fix them; therefore I will cheerfully pay $2.00 reward to 
the first finder of each technical, typographical, or historical error. 

Stanford , California D. E. K. 

July 1980 


’Defendit numerus/ [there is safety in numbers] 

is the maxim of the foolish; 

'Deperdit numerus/ [there is ruin in numbers] 

of the wise. 


—C. C. COLTON (1820) 






NOTES ON THE EXERCISES 


The exercises in this set of books have been designed for self-study as well 
as classroom study. It is difficult, if not impossible, for anyone to learn a subject 
purely by reading about it, without applying the information to specific problems 
and thereby being encouraged to think about what has been read. Furthermore, 
we all learn best the things that we have discovered for ourselves. Therefore the 
exercises form a major part of this work; a definite attempt has been made to 
keep them as informative as possible and to select problems that are enjoyable 
to solve. 

In many books, easy exercises are found mixed randomly among extremely 
difficult ones. This is sometimes unfortunate because readers like to know in 
advance how long a problem ought to take—otherwise they may just skip over 
all the problems. A classic example of such a situation is the book Dynamic 
Programming by Richard Bellman; this is an important, pioneering work in 
which a group of problems is collected together at the end of some chapters 
under the heading “Exercises and Research Problems,” with extremely trivial 
questions appearing in the midst of deep, unsolved problems. It is rumored that 
someone once asked Dr. Bellman how to tell the exercises apart from the research 
problems, and he replied, “If you can solve it, it is an exercise; otherwise it’s a 
research problem.” 

Good arguments can be made for including both research problems and 
very easy exercises in a book of this kind; therefore, to save the reader from 
the possible dilemma of determining which are which, rating numbers have been 
provided to indicate the level of difficulty. These numbers have the following 
general significance: 

Rating Interpretation 

00 An extremely easy exercise that can be answered immediately if the material 
of the text has been understood; such an exercise can almost always be worked 
“in your head.” 

10 A simple problem that makes you think over the material just read, but it is 
by no means difficult. It should be possible to do this in one minute at most; 
pencil and paper may be useful in obtaining the solution. 

20 An average problem that tests basic understanding of the text material, but you 
may need about fifteen or twenty minutes to answer it completely. 


IX 
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30 A problem of moderate difficulty and/or complexity; this one may involve over 
two hours’ work to solve satisfactorily. 

40 Quite a difficult or lengthy problem that would be suitable for a term project 
in classroom situations. It is expected that a student will be able to solve the 
problem in a reasonable amount of time, but the solution is not trivial. 

50 A research problem that has not yet been solved satisfactorily, as far as the 
author knew at the time of writing, although many people have tried. If you 
have found an answer to such a problem, you ought to write it up for publication; 
furthermore, the author of this book would appreciate hearing about the solution 
as soon as possible (provided that it is correct). 

By interpolation in this “logarithmic” scale, the significance of other rating 
numbers becomes clear. For example, a rating of 17 would indicate an exercise 
that is a bit simpler than average. Problems with a rating of 50 that are 
subsequently solved by some reader may appear with a 45 rating in later editions 
of the book. 

The author has earnestly tried to assign accurate rating numbers, but it is 
difficult for the person who makes up a problem to know just how formidable it 
will be for someone else to find a solution; and everyone has more aptitude for 
certain types of problems than for others. It is hoped that the rating numbers 
represent a good guess as to the level of difficulty, but they should be taken as 
general guidelines, not as absolute indicators. 

This book has been written for readers with varying degrees of mathematical 
training and sophistication; as a result, some of the exercises are intended only for 
the use of more mathematically inclined readers. The rating is preceded by an M 
if the exercise involves mathematical concepts or motivation to a greater extent 
than necessary for someone who is primarily interested only in programming 
the algorithms themselves. An exercise is marked with the letters U HM ” if its 
solution necessarily involves a knowledge of calculus or other higher mathematics 
not developed in this book. An u HM n designation does not necessarily imply 
difficulty. 

Some exercises are preceded by an arrowhead, this designates problems 
that are especially instructive and that are especially recommended. Of course, 
no reader/student is expected to work all of the exercises, so those that seem to 
be the most valuable have been singled out. (This is not meant to detract from 
the other exercises!) Each reader should at least make an attempt to solve all 
of the problems whose rating is 10 or less; and the arrows may help to indicate 
which of the problems with a higher rating should be given priority. 

Solutions to most of the exercises appear in the answer section. Please use 
them wisely; do not turn to the answer until you have made a genuine effort 
to solve the problem by yourself, or unless you do not have time to work this 
particular problem. After getting your own solution or giving the problem a 
decent try, you may find the answer instructive and helpful. The solution given 
will often be quite short, and it will sketch the details under the assumption 
that you have earnestly tried to solve it by your own means first. Sometimes the 
solution gives less information than was asked; often it gives more. It is quite 
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possible that you may have a better answer than the one published here, or you 
may have found an error in the published solution; in such a case, the author 
will be pleased to know the details. Later editions of this book will give the 
improved solutions together with the solver’s name where appropriate. 

When working an exercise you may generally use the answers to previous 
exercises, unless specifically forbidden from doing so. The rating numbers have 
been assigned with this in mind; thus it is possible for exercise n + 1 to have a 
lower rating than exercise n, even though it includes the result of exercise n as 
a special case. 


Summary of codes: 

00 

10 

20 

Immediate 

Simple (one minute) 
Medium (quarter hour) 

► 

Recommended 

SO 

Moderately hard 

M 

Mathematically oriented 

40 

Term project 

HM 

Requiring “higher math” 

50 

Research problem 


EXERCISES 

► 1. [ 00 ] What does the rating mean? 

2, [10] Of what value can the exercises in a textbook be to the reader? 

3. [ M50 ] Prove that when n is an integer, n > 2, the equation x n -f y n — z n has 
no solution in positive integers x, y, z. 


Exercise is the beste intrument in leernyng. 
—ROBERT RECORDE (The Whetstone of Witte, 1557) 
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CHAPTER THREE 


RANDOM NUMBERS 


Anyone who considers arithmetical 
methods of producing random digits 
is, of course, in a state of sin. 

—JOHN VON NEUMANN (1951) 


Lest men suspect your tale untrue. 
Keep probability in view. 

—JOHN GAY (1727) 


There wanted not some beams of light 
to guide men in the exercise of their Stocastick faculty. 

—JOHN OWEN (1662) 


3.1. INTRODUCTION 

NUMBERS that are “chosen at random” are useful in many different kinds of 
applications. For example: 

a) Simulation. When a computer is being used to simulate natural phenomena, 
random numbers are required to make things realistic. Simulation covers many 
fields, from the study of nuclear physics (where particles are subject to random 
collisions) to operations research (where people come into, say, an airport at 
random intervals). 

b) Sampling. It is often impractical to examine all possible cases, but a random 
sample will provide insight into what constitutes “typical” behavior. 

c) Numerical analysis. Ingenious techniques for solving complicated numerical 
problems have been devised using random numbers. Several books have been 
written on this subject. 

d) Computer programming. Random values make a good source of data for 
testing the effectiveness of computer algorithms. This is the primary application 
of interest to us in this series of books; it accounts for the fact that random 
numbers are already being considered here in Chapter 3, before most of the 
other computer algorithms have appeared. 

l 
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Several people experimented with the middle-square method in the early 
1950s. Working with numbers that have four digits instead of ten, G. E. Forsythe 
tried 16 different starting values and found that 12 of them led to sequences 
ending with the cycle 6100, 2100, 4100, 8100, 6100, ..., while two of them 
degenerated to zero. N. Metropolis also conducted extensive tests on the middle- 
square method, mostly in the binary number system. He showed that when 20- 
bit numbers are being used, there are 13 different cycles into which the sequence 
might degenerate, the longest of which has a period of length 142. 

It is fairly easy to restart the middle-square method on a new value when 
zero has been detected, but long cycles are somewhat harder to avoid. Exercise 7 
discusses some interesting ways to determine the cycles of periodic sequences, 
using very little memory space. 

A theoretical disadvantage of the middle-square method is given in exercises 
9 and 10. On the other hand, working with 38-bit numbers, Metropolis obtained a 
sequence of about 750,000 numbers before degeneracy occurred, and the resulting 
750,000 X 38 bits satisfactorily passed statistical tests for randomness. This 
shows that the middle-square method can give usable results, but it is rather 
dangerous to put much faith in it until after elaborate computations have been 
performed. 

Many random number generators in use today are not very good. There is 
a tendency for people to avoid learning anything about such subroutines; quite 
often we find that some old method that is comparatively unsatisfactory has 
blindly been passed down from one programmer to another, and today’s users 
have no understanding of its limitations. We shall see in this chapter that it is 
not difficult to learn the most important facts about random number generators 
and their proper use. 

It is not easy to invent a foolproof source of random numbers. This fact was 
convincingly impressed upon the author several years ago, when he attempted 
to create a fantastically good generator using the following peculiar approach: 

Algorithm K (“ Super-random ” number generator ). Given a 10-digit decimal 
number X, this algorithm may be used to change X to the number that should 
come next in a supposedly random sequence. Although the algorithm might be 
expected to yield quite a random sequence, reasons given below show that it 
is, in fact, not very good at all. (The reader need not study this algorithm in 
great detail except to observe how complicated it is; note, in particular, steps 
K1 and K2.) 

Kl. [Choose number of iterations.] Set Y <— [X/10 9 J , the most significant 
digit of X. (We will execute steps K2 through K13 exactly Y -j- 1 times; 
that is, we will apply randomizing transformations a random number of 
times.) 

K2. [Choose random step.] Set Z <— [_X/10 8 J mod 10, the second most signifi¬ 
cant digit of X. Go to step K(3 -J- Z). (That is, we now jump to a random 
step in the program.) 

K3. [Ensure > 5 X 10 9 .] If X < 5000000000, set X +- X -j- 5000000000. 
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device. A famous random-number machine called ERNIE has been used to pick 
the winning numbers in the British Premium Savings Bonds lottery. [See the 
articles by Kendall and Babington-Smith in J. Royal Stat. Soc., Series A, 101 
(1938), 147-166, and Series B, 6 (1939), 51-61; see also the review of the RAND 
table in Math. Comp. 10 (1956), 39-43, and the discussion of ERNIE by W. E. 
Thomson et al., J. Royal Stat. Soc., Series A, 122 (1959), 301-333.] 

Shortly after computers were introduced, people began to search for efficient 
ways to obtain random numbers within computer programs. A table can be used, 
but this method is of limited utility because of the memory space and input time 
requirement, because the table may be too short, and because it is a bit of a 
nuisance to prepare and maintain the table. Machines such as ERNIE might 
be attached to the computer, but this would be unsatisfactory since it would be 
impractical to reproduce calculations exactly a second time when checking out 
a program; and such machines have tended to suffer from malfunctions that are 
difficult to detect. 

The inadequacy of these methods led to an interest in the production of 
random numbers using the arithmetic operations of a computer. John von 
Neumann first suggested this approach in about 1946, using the “middle-square” 
method. His idea was to take the square of the previous random number and 
to extract the middle digits; for example, if we are generating 10-digit numbers 
and the previous value was 5772156649, we square it to get 

33317792380594909201, 

and the next number is therefore 7923805949. 

There is a fairly obvious objection to this technique: how can a sequence 
generated in such a way be random, since each number is completely determined 
by its predecessor? The answer is that this sequence isn ’t random, but it appears 
to be. In typical applications the actual relationship between one number and 
its successor has no physical significance; hence the nonrandom character is 
not really undesirable. Intuitively, the middle square seems to be a fairly good 
scrambling of the previous number. 

Sequences generated in a deterministic way such as this are usually called 
pseudo-random or quasi-random sequences in the highbrow technical literature, 
but in this book we shall simply call them random sequences, with the under¬ 
standing that they only appear to be random. Being “apparently random” is 
perhaps all that can be said about any random sequence anyway. Random 
numbers generated deterministically on computers have worked quite well in 
nearly every application, provided that a suitable method has been carefully 
selected. Of course, deterministic sequences aren’t always the answer; they cer¬ 
tainly shouldn’t replace ERNIE for the lotteries. 

Von Neumann’s original “middle-square method” has actually proved to be a 
comparatively poor source of random numbers. The danger is that the sequence 
tends to get into a rut, a short cycle of repeating elements. For example, if zero 
ever appears as a number of the sequence, it will continually perpetuate itself. 
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e) Decision making. There are reports that many executives make their deci¬ 
sions by flipping a coin or by throwing darts, etc. It is also rumored that some 
college professors prepare their grades on such a basis. Sometimes it is impor¬ 
tant to make a completely “unbiased” decision; this ability is occasionally useful 
in computer algorithms, for example in situations where a fixed decision made 
each time would cause the algorithm to run more slowly. Randomness is also an 
essential part of optimal strategies in the theory of games. 

f) Recreation. Rolling dice, shuffling decks of cards, spinning roulette wheels, 
etc., are fascinating pastimes for just about everybody. These traditional uses 
of random numbers have suggested the name “Monte Carlo method,” a general 
term used to describe any algorithm that employs random numbers. 

People who think about this topic almost invariably get into philosophical 
discussions about what the word “random” means. In a sense, there is no such 
thing as a random number; for example, is 2 a random number? Rather, we speak 
of a sequence of independent random numbers with a specified distribution, and 
this means loosely that each number was obtained merely by chance, having 
nothing to do with other numbers of the sequence, and that each number has a 
specified probability of falling in any given range of values. 

A uniform distribution on a finite set of numbers is one in which each possible 
number is equally probable. A distribution is generally understood to be uniform 
unless some other distribution is specifically mentioned. 

Each of the ten digits 0 through 9 will occur about of the time in a 
(uniform) sequence of random digits. Each pair of two successive digits should 
occur about ^ of the time, etc. Yet if we take a truly random sequence of a 
million digits, it will not always have exactly 100,000 zeros, 100,000 ones, etc. In 
fact, chances of this are quite slim; a sequence of such sequences will have this 
character on the average. 

Any specified sequence of a million digits is equally as probable as the 
sequence consisting of a million zeros. Thus, if we are choosing a million digits at 
random and if the first 999,999 of them happen to come out to be zero, the chance 
that the final digit is zero is still exactly in a truly random situation. These 
statements seem paradoxical to many people, but there is really no contradiction 
involved. 

There are several ways to formulate decent abstract definitions of random¬ 
ness, and we will return to this interesting subject in Section 3.5; but for the 
moment, let us content ourselves with an intuitive understanding of the concept. 

At first, people who needed random numbers in their scientific work would 
draw balls out of a “well-stirred urn” or would roll dice or deal out cards. A 
table of over 40,000 random digits, “taken at random from census reports,” was 
published in 1927 by L. H. C. Tippett. Since then, a number of devices have 
been built to generate random numbers mechanically; the first such machine was 
used in 1939 by M. G. Kendall and B. Babington-Smith to produce a table of 
100,000 random digits, and in 1955 the RAND Corporation published a widely 
used table of a million random digits obtained with the help of another special 
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K4. [Middle square.] Replace X by |X 2 /10 5 ] modlO 10 , i.e., by the middle of 
the square of X. 

K5. [Multiply.] Replace X by (1001001001X) mod 10 10 . 

K6. [Pseudo-complement.] If X < 100000000, then set X «— X + 9814055677; 
otherwise set X <— 10 10 — X. 

K7. [Interchange halves.] Interchange the low-order five digits of X with the 
high-order five digits, i.e., X 10 5 (Xmod 1(F) -f- [X/10 5 J, the middle 10 
digits of (10 10 -f- 1)X. 

K8. [Multiply.] Same as step K5. 

K9. [Decrease digits.] Decrease each nonzero digit of the decimal representation 
of X by one. 

K10. [99999 modify.] If X < 10 5 , set X <— X 2 + 99999; otherwise set X <— 
X — 99999. 

Kll. [Normalize.] (At this point X cannot be zero.) If X < 10 9 , set X 10X 
and repeat this step. 

K12. [Modified middle square.] Replace X by |_X(X — 1)/10 5 J mod 10 10 , i.e., by 
the middle 10 digits of X(X — 1). 

K13. [Repeat?] If Y >0, decrease Y by 1 and return to step K2. If Y = 0, the 
algorithm terminates with X as the desired “random” value. | 

(The machine-language program corresponding to the above algorithm was in¬ 
tended to be so complicated that a person reading a listing of it without ex¬ 
planatory comments wouldn’t know what the program was doing.) 

Considering all the contortions of Algorithm K, doesn’t it seem plausible 
that it should produce almost an infinite supply of unbelievably random num¬ 
bers? No! In fact, when this algorithm was first put onto a computer, it almost 
immediately converged to the 10-digit value 6065038420, which—by extraordi¬ 
nary coincidence—is transformed into itself by the algorithm (see Table 1). With 
another starting number, the sequence began to repeat after 7401 values, in a 
cyclic period of length 3178. 

The moral of this story is that random numbers should not be generated 
with a method chosen at random. Some theory should be used. 

In this chapter, we shall consider random number generators that are su¬ 
perior to the middle-square method and to Algorithm K; the corresponding se¬ 
quences are guaranteed to have certain desirable random properties, and no 
degeneracy will occur. We shall explore the reasons for this random behavior 
in some detail, and we shall also consider techniques for manipulating random 
numbers. For example, one of our investigations will be the shuffling of a simu¬ 
lated deck of cards within a computer program. 

Section 3.6 summarizes this chapter and lists several bibliographic sources. 



6 


RANDOM NUMBERS 


Table 1 

A COLOSSAL COINCIDENCE: THE NUMBER 6065038420 
IS TRANSFORMED INTO ITSELF BY ALGORITHM K. 


3.1 


Step X (after) 

K1 6065038420 

K3 6065038420 

K4 6910360760 

K5 8031120760 

K6 1968879240 

K7 7924019688 

K8 9631707688 

K9 8520606577 

K10 8520506578 

Kll 8520506578 

K12 0323372207 Y — 6 

K6 9676627793 

K7 2779396766 

K8 4942162766 

K9 3831051655 

K10 3830951656 

Kll 3830951656 

K12 1905867781 Y = 5 

K12 3319967479 Y = 4 

K6 6680032521 

K7 3252166800 

K8 2218966800 


Step 

X (after) 


K9 

1107855700 


K10 

1107755701 


Kll 

1107755701 


K12 

1226919902 

Y = 3 

K5 

0048821902 


K6 

9862877579 


K7 

7757998628 


K8 

2384626628 


K9 

1273515517 


K10 

1273415518 


Kll 

1273415518 


K12 

5870802097 

Y = 2 

Kll 

5870802097 


K12 

3172562687 

Y = 1 

K4 

1540029446 


K5 

7015475446 


K6 

2984524554 


K7 

2455429845 


K8 

2730274845 


K9 

1620163734 


K10 

1620063735 


Kll 

1620063735 


K12 

6065038420 

o 

II 


EXERCISES 

► 1. [20] Suppose that you wish to obtain a decimal digit at random, not using a 
computer. Which of the following methods would be suitable? 

a) Open a telephone directory to a random place (i.e., stick your finger in it some¬ 
where) and use the units digit of the first number found on the selected page. 

b) Same as (a), but use the units digit of the page number. 

c) Roll a die that is in the shape of a regular icosahedron, whose twenty faces have 
been labeled with the digits 0,0,1,1,..., 9,9. Use the digit that appears on top, 
when the die comes to rest. (A felt table with a hard surface is recommended for 
rolling dice.) 

d) Expose a geiger counter to a source of radioactivity for one minute (shielding 
yourself) and use the units digit of the resulting count. Assume that the geiger 
counter displays the number of counts in decimal notation, and that the count is 
initially zero. 

e) Glance at your wristwatch; and if the position of the second-hand is between 6n 
and 6 (n -(- 1) seconds, choose the digit n. 

f) Ask a friend to think of a random digit, and use the digit he names. 

g) Ask an enemy to think of a random digit, and use the digit he names. 
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h) Assume that 10 horses are entered in a race and that you know nothing whatever 
about their qualifications. Assign to these horses the digits 0 to 9, in arbitrary 
fashion, and after the race use the winner’s digit. 

2. [ M22 ] In a random sequence of a million decimal digits, what is the probability 
that there are exactly 100,000 of each possible digit? 

3. [10] What number follows 1010101010 in the middle-square method? 

4. [10] Why can’t the value of X be zero when step K11 of Algorithm K is performed? 
What would be wrong with the algorithm if X could be zero? 

5. [15] Explain why, in any case, Algorithm K should not be expected to provide 
“infinitely many” random numbers, in the sense that (even if the coincidence given 
in Table 1 had not occurred) one knows in advance that any sequence generated by 
Algorithm K will eventually be periodic. 

► 6. [M21] Suppose that we want to generate a sequence of integers Xo, Xi, X 2 , ..., 
in the range 0 < X n < m. Let f(x) be any function such that 0 < x < m implies 
0 < f{x) < m. Consider a sequence formed by the rule X n +i = f{X n ). (Examples 
are the middle-square method and Algorithm K.) 

a) Show that the sequence is ultimately periodic, in the sense that there exist numbers 
X and [i for which the values Xq, Xi, ..., X M , ..., X M _|_\_i are distinct, but 
Xn+x = X n when n > fi. Find the maximum and minimum possible values of jj, 
and X. 

b) (R. W. Floyd.) Show that there exists an n > 0 such that X n = X 2 n ; and the 
smallest such value of n lies in the range (jl < n < p -(- X. Furthermore the value 
of X n is unique in the sense that if X n — X 2 n and X r = X 2 r , then X r = X n . 

c) Use the idea of part (b) to design an algorithm that calculates p and X for any 
given function / and any given Xo, using only 0(p-|- X) steps and only a bounded 
number of memory locations. 

► 7. [M21] (R. P. Brent, 1977.) Let i[n) be the least power of 2 that is less than or 
equal to n; thus, for example, ^(15) = 8 and £{£{n)) — £{n). 

a) Show that, in terms of the notation in exercise 6, there exists an n > 0 such that 
X n = X*( n )_i. Find a formula that expresses the least such n in terms of p and 
X. 

b) Apply this result to design an algorithm that can be used in conjunction with any 
random number generator of the type X n +i = /(X n ), to prevent it from cycling 
indefinitely. Your algorithm should calculate the period length X, and it should 
use only a small amount of memory space—you must not simply store all of the 
computed sequence values! 

8. [28] Make a complete examination of the middle-square method in the case of two- 
digit decimal numbers, (a) We might start the process out with any of the 100 possible 
values 00, 01, ..., 99. How many of these values lead ultimately to the repeating cycle 
00, 00, ... ? [Example: Starting with 43, we obtain the sequence 43, 84, 05, 02, 00, 00, 
00, ....] (b) How many possible final cycles are there? How long is the longest cycle? (c) 
What starting value or values will give the largest number of distinct elements before 
the sequence repeats? 

9. [M14] Prove that the middle-square method using 2n-digit numbers to the base b 
has the following disadvantage: If the sequence includes any number whose most 
significant n digits are zero, the succeeding numbers will get smaller and smaller until 
zero occurs repeatedly. 
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10 . [Ml 6] Under the assumptions of the preceding exercise, what can you say about 
the sequence of numbers following X if the least significant n digits of X are zero? 
What if the least significant n-\- 1 digits are zero? 

► 11 . [M26] Consider sequences of random number generators having the form de¬ 
scribed in exercise 6. If we choose f(x ) and Xo at random, i.e., if we assume that 
each of the ra m possible functions f(x) is equally probable and that each of the m 
possible values of Xo is equally probable, what is the probability that the sequence will 
eventually degenerate into a cycle of length X = 1? (Note: The assumptions of this 
problem give a natural way to think of a “random” random number generator of this 
type. A method such as Algorithm K may be expected to behave somewhat like the 
generator considered here; the answer to this problem gives a measure of how “colossal” 
the coincidence of Table 1 really is.) 

► 12. [MSI] Under the assumptions of the preceding exercise, what is the average length 
of the final cycle? What is the average length of the sequence before it begins to cycle? 
(In the notation of exercise 6, we wish to examine the average values of X and of (i- |-X.) 

13. [M 42 ] If f(x) is chosen at random in the sense of exercise 11 , what is the average 
length of the longest cycle obtainable by varying the starting value Xo? (Note: We 
have already considered the analogous problem in the case that f(x ) is a random 
permutation; see exercise 1.3.3-23.) 

14. [ MS8 ] If f(x ) is chosen at random in the sense of exercise 11, what is the average 
number of distinct final cycles obtainable by varying the starting value? [Cf. exercise 
8(b).] 

15. [ M15} If f(x) is chosen at random in the sense of exercise 11, what is the prob¬ 
ability that none of the final cycles has length 1, regardless of the choice of Xo? 

16. [15] A sequence generated as in exercise 6 must begin to repeat after at most m 
values have been generated. Suppose we generalize the method so that X n +j depends 
on X n —1 as well as on X n ; formally, let f(x,y ) be a function such that 0 < x, y < m 
implies 0 < f(x,y ) < m. The sequence is constructed by selecting Xo and Xi 
arbitrarily, and then letting 


Xn+l = /(Xn, Xn—l), for U > 0. 

What is the maximum period conceivably attainable in this case? 

17. [10] Generalize the situation in the previous exercise so that X n +i depends on 
the preceding k values of the sequence. 

18. [M20] Invent a method analogous to that of exercise 7 for finding cycles in the 
general form of random number generator discussed in exercise 17. 

19. [M 48 ] Solve the problems of exercises 11 through 15 for the more general case that 

k 

Xn+i depends on the preceding k values of the sequence; each of the m m functions 
f(x i,...,Xfc) is to be considered equally probable. (Note: The number of functions 
that yield the maximum period is analyzed in exercise 2.3.4.2-23.) 
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3.2. GENERATING UNIFORM RANDOM NUMBERS 

In THIS SECTION we shall consider methods for generating a sequence of random 
fractions, i.e., random real numbers U n , uniformly distributed between zero and 
one. Since a computer can represent a real number with only finite accuracy, we 
shall actually be generating integers X n between zero and some number m; the 
fraction 

Un — X n j rn 

will then lie between zero and one. Usually m is the word size of the computer, 
so X n may be regarded (conservatively) as the integer contents of a computer 
word with the radix point assumed at the extreme right, and U n may be regarded 
(liberally) as the contents of the same word with the radix point assumed at the 
extreme left. 


3.2.1. The Linear Congruential Method 

By far the most popular random number generators in use today are special 
cases of the following scheme, introduced by D. H. Lehmer in 1949. [See Proc. 
2nd Symp. on Large-Scale Digital Calculating Machinery (Cambridge: Harvard 
University Press, 1951), 141-146.] We choose four “magic numbers”: 


m, 

a, 

c, 

X 0 , 


the modulus; 
the multiplier; 
the increment; 
the starting value; 


rn > 0. 

0 < a < m. 

0 < c < m. 

0 < Xq < m 



The desired sequence of random numbers (X n ) is then obtained by setting 


Wa+i = ( aX n + c) modm, n > 0. 



This is called a linear congruential sequence. Taking the remainder mod m is 
somewhat like determining where a ball will land in a spinning roulette wheel. 
For example, the sequence obtained when m = 10 and Xq ~ a — c = 7 is 


7, 6, 9, 0, 7, 6, 9, 0, ... . (3) 

As this example shows, the sequence is not always “random” for all choices of 
m, a, c, and Xq) the principles of choosing the magic numbers appropriately will 
be investigated carefully in later parts of this chapter. 

Example (3) illustrates the fact that the congruential sequences always “get 
into a loop”; i.e., there is ultimately a cycle of numbers that is repeated endlessly. 
This property is common to all sequences having the general form X n q_i — 
f{X n ); see exercise 3.1-6. The repeating cycle is called the period; sequence (3) 
has a period of length 4. A useful sequence will of course have a relatively long 
period. 
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The special case c = 0 deserves explicit mention, since the number gener¬ 
ation process is a little faster when c = 0 than it is when c^O. We shall see later 
that the restriction c — 0 cuts down the length of the period of the sequence, 
but it is still possible to make the period reasonably long. Lehmer’s original 
generation method had c = 0 , although he mentioned c ^ 0 as a possibility; 
the idea of taking c ^ 0 to obtain longer periods is due to Thomson [Comp. 
J. 1 (1958), 83, 86 ] and, independently, to Rotenberg [ JACM 7 (1960), 75—77]. 
The terms multiplicative congruential method and mixed congruential method 
are used by many authors to denote linear congruential sequences with c = 0 
and 0 , respectively. 

The letters m, a, c, and Xo will be used throughout this chapter in the sense 
described above. Furthermore, we will find it useful to define 

b = a — l, (4) 

in order to simplify many of our formulas. 

We can immediately reject the case a = 1 , for this would mean that X n = 
(.Xo + nc ) m °d rn, and the sequence would certainly not behave as a random 
sequence. The case a = 0 is even worse. Hence for practical purposes we may 
assume that 

a > 2, b> 1. (5) 

Now we can prove a generalization of Eq. (2), 

X n+fc — (a k X n + (a k — l)c/ 6 )modm, k > 0, n > 0 , ( 6 ) 

which expresses the (n+fcjth term directly in terms of the nth term. (The special 
case n ~ 0 in this equation is worthy of note.) It follows that the subsequence 
consisting of every fcth term of (X n ) is another linear congruential sequence, 
having the multiplier a k modm and the increment (( a k — l)c/b)modm. 

An important corollary of ( 6 ) is that the general sequence defined by m, a, 
c, and Xo can be expressed very simply in terms of the special case where c = 1 
and Xq — 0 . Let 


To = 0, Y n+1 = (aY n + l)modm. (7) 

According to Eq. (6) we will have YJt = (a k — 1 )/b (modulo m), hence the general 
sequence defined in (2) satisfies 

X n = ( AY n + Xq) mod m, where A = (Xq b c) mod m. (8) 


EXERCISES 

1. [10] Example (3) shows a situation in which X 4 = Xo, so the sequence begins 
again from the beginning. Give an example of a linear congruential sequence with 
m — 10 for which Xo never appears again in the sequence. 
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► 2. [M20\ Show that if a and m are relatively prime, the number Xo will always 
appear in the period. 

3. [M10\ If a and m are not relatively prime, explain why the sequence will be 
somewhat handicapped and probably not very random; hence we will generally want 
the multiplier a to be relatively prime to the modulus m. 

4. [11] Prove Eq. (6). 

5. \M20] Equation (6) holds for k > 0. If possible, give a formula that expresses 
X n +fc in terms of X n for negative values of k. 


3.2.1.1. Choice of modulus. Our current goal is to find good values for the 
parameters that define a linear congruential sequence. Let us first consider the 
proper choice of the number m. We want m to be rather large, since the period 
cannot have more than m elements. (Even if a person wants to generate only 
random zeros and ones, he should not take m — 2, for then the sequence would at 
best have the form ..., 0,1,0,1,0,1,...! Methods for modifying random numbers 
to get random zeros and ones are discussed in Section 3.4.) 

Another factor that influences our choice of m is speed of generation: We 
want to pick a value so that the computation of (aX n -(- c) mod m is fast. 

Consider MIX as an example. We can compute ymodm by putting y in 
registers A and X and dividing by m; assuming that y and m are positive, we 
see that y mod m will then appear in register X. But division is a comparatively 
slow operation, and it can be avoided if we take m to be a value that is especially 
convenient, such as the word size of our computer. 

Let w be the computer’s word size, namely, 2 e on an e-bit binary computer or 
10 e on an e-digit decimal machine. (In this book we shall often use the letter e to 
denote an arbitrary integer exponent, instead of the base of natural logarithms, 
hoping that the context will make our notation unambiguous. Physicists have 
a similar problem when they use e for the charge on an electron.) The result 
of an addition operation is usually given modulo w, except on ones’-complement 
machines; and multiplication mod w is also quite simple, since the desired result 
is the lower half of the product. Thus, the following program computes the 
quantity (aX -f- c) mod w efficiently: 

LDA A rA <— a. 

MUL X rAX <— (rA) • X. ( ) 

SLAX 5 rA <- rAX mod w. 

ADD C rA <— (rA c)mod«J. | 

The result appears in register A. The overflow toggle might be on at the conclu¬ 
sion of the above sequence of instructions, and if this is undesirable, the code 
should be followed by, e.g., “JOV *+l” to turn it off. 

A clever technique that is less commonly known can be used to perform 
computations modulo (w-f-l). For reasons to be explained later, we will generally 
want c = 0 when m = w -j- 1, so we merely need to compute (aX) mod (w -J- 1). 
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The following program does this: 


01 

LDAN 

X 

02 

MUL 

A 

03 

STX 

TEMP 

04 

SUB 

TEMP 

05 

JANN 

*+3 

06 

INCA 

2 

07 

ADD 

-U) — 1 


rA 4 -X. 

rAX <— (rA) • a. 

rA <— rA — rX. 
Exit if rA > 0. 
rA <— rA + 2. 
rA <— r A -|- w — 1. 


(Cf. exercise 3.) 


I 



Register A now contains the value (i aX ) mod (w -(- 1). Of course, this value might 
lie anywhere between 0 and w, inclusive, so the reader may legitimately wonder 
how we can represent so many values in the A-register! (The register obviously 
cannot hold a number larger than w — 1.) The answer is that overflow will be on 
after the above program if and only if the result equals w, assuming that overflow 
was initially off. It is convenient simply to reject the value w if it appears in the 
congruential sequence modulo w - j- 1; this will happen if lines 05 and 06 of (2) 
are replaced by “JANN *+4; INCA 2; JAP *-5”. 

To prove that code (2) actually does determine (aX) mod(w -j- 1), note that 
in line 04 we are subtracting the lower half of the product from the upper half. 
No overflow can occur at this step; and if aX = qw -\-r, with 0 < r < w, we 
will have the quantity r — q in register A after line 04. Now 


aX = q{w + l) + (r — q), 

and since q <2 w, we have —w < r — q < ie; hence (aX)mod(ie -J- 1) equals 
either r — q or r — q -j- (w -(- 1), depending on whether r — q > 0 or r — q < 0. 

A similar technique can be used to get the product of two numbers modulo 
(w — 1); see exercise 8. 

In later sections we shall require a knowledge of the prime factors of m in 
order to choose the multiplier a correctly. Table 1 lists the complete factorization 
of w ± 1 into primes for nearly every known computer word size; the methods 
of Section 4.5.4 can be used to extend this table if desired. 

The reader may well ask why we bother to consider using m = w± 1, when 
the choice m = w is so manifestly convenient. The reason is that when m — w, 
the right-hand digits of X n are much less random than the left-hand digits. If 
d is a divisor of m, and if 


Y n — X n mod d, (3) 

we can easily show that 

Y n+ i= (, aY n + c) mod d. (4) 

(For, X n +i = aX n + c — qm for some integer q, and taking both sides mod d 
causes the quantity qm to drop out when d is a factor of m.) 

To illustrate the significance of Eq. (4), let us suppose, for example, that 
we have a binary computer. If m — w = 2 e , the low-order four bits of X n are 
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Table 1 

PRIME FACTORIZATIONS OF w £ 1 


T — 1 

e 

2 e +1 

7 • 31 • 151 

15 

3 2 ■ 11 • 331 

3-5- 17-257 

16 

65537 

131071 

17 

3■43691 

3 3 • 7 • 19 • 73 

18 

5 • 13 • 37 • 109 

524287 

19 

3 • 174763 

3 • 5 2 ■ 11 ■ 31.41 

20 

17 • 61681 

7 2 ■ 127 • 337 

21 

3 2 ■ 43 • 5419 

3 • 23 • 89 • 683 

22 

5 - 397 • 2113 

47 • 178481 

23 

3•2796203 

3 2 ■ 5 • 7 • 13 • 17 • 241 

24 

97 ■ 257 ■ 673 

31 • 601 ■ 1801 

25 

3 • 11 ■ 251 • 4051 

3 • 2731 • 8191 

26 

5-53-157-1613 

7 • 73 • 262657 

27 

3 4 • 19 • 87211 

3 • 5 ■ 29 • 43 • 113 • 127 

28 

17 • 15790321 

233 • 1103 • 2089 

29 

3 - 59 - 3033169 

3 2 ■ 7 • 11 • 31 • 151 • 331 

30 

5 2 ■ 13-41-61 • 1321 

2147483647 

31 

3 - 715827883 

3 • 5 • 17 ■ 257 • 65537 

32 

641 • 6700417 

7•23 • 89 ■ 599479 

33 

3 2 • 67 ■ 683 ■ 20857 

3 • 43691 • 131071 

34 

5 • 137 • 953 • 26317 

31 ■ 71 • 127 • 122921 

35 

3 • 11 • 43 • 281 ■ 86171 

3 3 • 5 • 7 • 13 • 19 • 37 • 73 • 109 

36 

17 • 241 • 433 • 38737 

223 • 616318177 

37 

3 • 1777 • 25781083 

3 • 174763 • 524287 

38 

5 • 229 • 457 • 525313 

7-79-8191-121369 

39 

3 2 • 2731 • 22366891 

3 -5 2 -11 • 17 -31 -41-61681 

40 

257 ■ 4278255361 

13367-164511353 

l 41 

3 • 83 • 8831418697 

3 2 ■ 7 2 -43 • 127-337 • 5419 

! 42 

5 ■ 13 • 29 • 113 • 1429 ■ 14449 

431 • 9719 • 2099863 

43 

3•2932031007403 

3-5-23-89-397-683-2113 

i 44 

17-353-2931542417 

7 • 31 • 73 ■ 151 • 631 • 23311 

45 

3 3 • 11 • 19-331• 18837001 

3 • 47 ■ 178481 • 2796203 

46 

5 • 277 • 1013 • 1657 • 30269 

2351 ■ 4513 • 13264529 

47 

3 ■ 283 ■ 165768537521 

3 2 • 5 ■ 7 ■ 13 • 17 • 97 • 241 • 257 • 673 

48 

193 • 65537 - 22253377 

179951■3203431780337 

59 

3 • 2833 ■ 37171 • 1824726041 

3 2 ■ 5 2 • 7 • 11 • 13 • 31 • 41 • 61 ■ 151 • 331 • 1321 

60 

17 • 241 ■ 61681 • 4562284561 

7 2 • 73 • 127 • 337 • 92737 ■ 649657 

63 

3 3 • 19 - 43 • 5419 - 77158673929 

3 ■ 5 • 17 • 257 ■ 641,65537 • 6700417 

64 

274177 • 67280421310721 

10 c - 1 

e 

| io e 4-i 

3 3 • 7 • 11 • 13 ■ 37 

6 

101 • 9901 

3 2 • 239 ■ 4649 

7 

11 • 909091 

3 2 • 11 ■ 73 • 101 • 137 

8 

17 • 5882353 

3 4 • 37 ■ 333667 

9 

7- 11■13-19-52579 

3 2 • 11 -41 -271 -9091 

10 

101-3541-27961 

3 2 • 21649 ■ 513239 

11 

ll 2 • 23 • 4093 • 8779 

3 3 -7-11-13-37-101-9901 

12 

73 ■ 137 • 99990001 

3 2 • 11 ■ 17 • 73 • 101 • 137 ■ 5882353 

16 

353 • 449 • 641 • 1409 • 69857 
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the numbers Y n = X n mod2 4 . The gist of Eq. (4) is that the low-order four 
bits of ( X n ) form a congruential sequence that has a period of length 16 or less. 
Similarly, the low-order five bits are periodic with a period of at most 32; and 
the least significant bit of X n is either constant or strictly alternating. 

This situation does not occur when m = in such a case, the low-order 

bits of X n will behave just as randomly as the high-order bits do. If, for example, 
w = 2 35 and m = 2 35 — 1, the numbers of the sequence will not be very random 
if we consider only their remainders mod 31, 71, 127, or 122921 (cf. Table 1); but 
the low-order bit, which represents the numbers of the sequence taken mod 2, 
would be satisfactorily random. 

Another alternative is to let m be the largest prime number less than w. 
This prime may be found by using the techniques of Section 4.5.4, and a table 
of suitably large primes appears in that section. 

In most applications, the low-order bits are insignificant, and the choice 
m = w is quite satisfactory—provided that the programmer using the random 
numbers does so wisely. 

Our discussion so far has been based on a “signed magnitude” computer like 
MIX. Similar ideas apply to machines that use complement notations, although 
there are some instructive variations. For example, a DEC 20 computer has 36 
bits with two’s complement arithmetic; when it computes the product of two 
nonnegative integers, the lower half contains the least significant 35 bits with a 
plus sign. On this machine we should therefore take w = 2 35 , not 2 36 . The 
32-bit two’s complement arithmetic on IBM System/370 computers is different: 
the lower half of a product contains a full 32 bits. Some programmers have 
felt that this is a disadvantage, since the lower half can be negative when the 
operands are positive, and it is a nuisance to correct this; but actually it is a 
distinct advantage from the standpoint of random number generation, since we 
can take m = 2 32 instead of 2 31 (see exercise 4). 


EXERCISES 

1. [M12] In exercise 3.2.1-3 we concluded that the best congruential generators will 
have the multiplier a relatively prime to m. Show that when m = w in this case it 
is possible to compute (aX c)modw in just three MIX instructions, rather than the 
four in (1), with the result appearing in register X. 

2. [16] Write a MIX subroutine having the following characteristics: 

Calling sequence: JMP RANDM 

Entry conditions: Location XRAND contains an integer X. 

Exit conditions: X <— rA •*— (aX -f- cjmodty, rX «— 0, overflow off. 


(Thus a call on this subroutine will produce the next random number of a linear 
congruential sequence.) 
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3. [20] How can the constant (w — 1) be specified in general in the MIX assembly 
language, regardless of the value of the byte size? 

► 4. [21] Discuss the calculation of linear congruential sequences with m = 2 32 on 
two’s-complement machines such as the System/370 series. 

5. [20] Given that m is less than the word size, and that x, y are nonnegative integers 
less than m, show that the difference (x — y)modm may be computed in just four 
MIX instructions, without requiring any division. What is the best code for the sum 
(x -f y) mod m? 

► 6. [20] The previous exercise suggests that subtraction mod m is easier to perform 
than addition mod m. Discuss sequences generated by the rule 

X n +i = ( aX n — c)modm. 

Are these sequences essentially different from linear congruential sequences as defined 
in the text? Are they more suited to efficient computer calculation? 

7. [M24] What patterns can you spot in Table 1? 

► 8. [20] Write a MIX program analogous to (2) that computes (aX) mod (w — 1). The 
values 0 and w — 1 are to be treated as equivalent in the input and output of your 
program. 

9. [23] Write a MIX program analogous to the one in exercise 8, but it should compute 
(aX)mod (w — 2). 


3.2.1.2. Choice of multiplier. In this section we shall show how to choose the 
multiplier a so as to give the period of maximum length. A long period is essential 
for any sequence that is to be used as a source of random numbers; indeed, we 
would hope that the period contains considerably more numbers than will ever 
be used in a single application. Therefore we shall concern ourselves in this 
section with the question of period length. The reader should keep in mind, 
however, that a long period is only one desirable criterion for the randomness of 
our sequence. For example, when a = c = 1, the sequence is simply — 

(X n -|- l)modm, and this obviously has a period of length m, yet it is anything 
but random. Other considerations affecting the choice of a multiplier will be 
given later in this chapter. 

Since only m different values are possible, the period surely cannot be longer 
than m. Can we achieve the maximum length, m? The example above shows 
that it is always possible, although the choice a = c = 1 does not yield a 
desirable sequence. Let us investigate all possible choices of a, c, and X 0 that 
give a period of length m. It turns out that all such values of the parameters 
can be characterized very simply; when m is the product of distinct primes, only 
a = 1 will produce the full period, but when m is divisible by a high power 
of some prime there is considerable latitude in the choice of a. The following 
theorem makes it easy to tell if the maximum period is achieved. 
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Theorem A. The linear congruential sequence defined by m, a, c, and X 0 has 
period length m if and only if 

i) c is relatively prime to m; 

ii) b = a~l is a multiple of p, for every prime p dividing m; 

iii) b is a multiple of 4, if m is a multiple of 4. 

The ideas used in the proof of this theorem go back at least a hundred 
years. The first proof of the theorem in this particular form was given by M. 
Greenberger in the special case m — 2 e [see JACM 8 (1961), 383-389], and the 
sufficiency of conditions (i), (ii), and (iii) in the general case was shown by Hull 
and Dobell [see SIAM Review 4 (1962), 230-254]. To prove the theorem we 
will first consider some auxiliary number-theoretic results that are of interest in 
themselves. 

Lemma P. Let p be a prime number , and let e be a positive integer, where 
p e >2. If 

x = 1 (modulo p e ), x ^ 1 (modulo p e+1 ), (1) 

then 

x p = 1 (modulo p e+1 ), x p ^ 1 (modulo p e+2 ). (2) 

Proof. We have x = 1 + QP e for some integer q that is not a multiple of p. By 
the binomial formula 

x p = i + (^)gp e + -- + ( p P _ X p -¥ p - 1)e + 9V' 

=i+ 9P £+i (i+ jgy+jg) 9 y'+• • ■+^y- i p <p - i)e } 

The quantity in parentheses is an integer, and, in fact, every term inside the 
parentheses is a multiple of p except the first term. For if 1 < k < p, the 
binomial coefficient (£) is divisible by p (cf. exercise 1.2.6-10), hence 

i(py- v *- 1)e 

is divisible by 1 ) e ; and the last term is q p — l p^~ 1 ) e — l ) which is divisible by 
p since (p — l)e > 1 when p e > 2. So x p = 1 + qp €+1 (modulo p e+2 ), and this 
completes the proof. (Note; A generalization of this result appears in exercise 
3.2.2-ll(a).) | 

Lemma Q. Let the decomposition of m into prime factors be 


( 3 ) 
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The length X of the period of the linear congruential sequence determined by 
(Xo, a, c, m) is the least common multiple of the lengths X 7 of the periods of the 
linear congruential sequences (Xo mod p ^ j , a mod p * j , c mod p^, p^ ’), i < i < t. 

Proof. By induction on t, it suffices to prove that if mi and ra 2 are relatively 
prime, the length X of the linear congruential sequence determined by the param¬ 
eters (Xo, a , c, m 1 m 2 ) is the least common multiple of the lengths Xi and X 2 of the 
periods of the sequences determined by (XomodrrH, amodmi, c mod mi, mi) 
and (X 0 modm 2 , a mod m 2 , cmodm 2 , m 2 ). We observed in the previous section, 
Eq. (4), that if the elements of these three sequences are respectively denoted by 
X n , Y n , and Z nt we will have 

Y n = X n mod ra x and Z n = X n mod m 2 , for all n > 0. 


Therefore, by Law D of Section 1.2.4, we find that 


X n = X fc if and only if Y n — Y k and Z n = Z k . (4) 


Let X' be the least common multiple of X* and X 2 ; we wish to prove that 
X' = X. Since X n = X n _)_x for all suitably large n, we have Y n — Y n _|_x (hence 
X is a multiple of Xi) and Z n = Z n _f_x (hence X is a multiple of X 2 ), so we must 
have X > X'. Furthermore, we know that Y n — and Z n — Z n +\> for all 

suitably large n; therefore, by (4), X n — X n _|_x'* This proves X < XL | 

Now we are ready to prove Theorem A. Because of Lemma Q, it suffices to 
prove the theorem when mis a power of a prime number. For 


— lcm(Xi,... ,X t ) < Xi ...X t < Pi 1 .. .p e t f 

can be true if and only if X 7 = p^ j for 1 <3 <t. 

Therefore, assume that m — p e , where p is prime and e is a positive integer. 
The theorem is obviously true when a = 1, so we may take a > 1. The period 
can be of length m if and only if each possible integer 0 < x < m occurs in the 
period, since no value occurs in the period more than once. Therefore the period 
is of length m if and only if the period of the sequence with Xo = 0 is of length 
m, and we are justified in supposing that X 0 = 0. By formula 3.2.1-6 we have 


X n 


a n — 1 
a — 1 


c mod m. 



If c is not relatively prime to m, this value X n could never be equal to 1, so 
condition (i) of the theorem is necessary. The period has length m if and only if 
the smallest positive value of n for which X n = X 0 = 0 is n = m. By (5) and 
condition (i), our theorem now reduces to proving the following fact: 
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Lemma R. Assume that 1 < a < p e , where p is prime. If X is the smallest 
positive integer for which (a x — l)/(a — 1) = 0 (modulo p e ), then 



if and only if 


fas 1 (modulo p) 
\ a = 1 (modulo 4) 


when p > 2, 
when p — 2. 


Proof. Assume that X — p e . If a ^ 1 (modulo p), then ( a n — l)/(a — 1) = 0 
(modulo p e ) if and only if a n — 1 = 0 (modulo p e ). The condition a p — 1 = 0 
(modulo p e ) then implies that a pe = 1 (modulo p); but by Theorem 1.2.4F we 
have a pe = a (modulo p), hence a ^ 1 (modulo p) leads to a contradiction. And 
if p = 2 and a = 3 (modulo 4), we have (a r — l)/(a — 1) = 0 (modulo 2 e ) 
by exercise 8. These arguments show that it is necessary in general to have 
a = 1 + QP f t where p^ > 2 and q is not a multiple of p, whenever X = p e . 

It remains to be shown that this condition is sufficient to make X = p e . By 
repeated application of Lemma P, we find that 

a p9 = 1 (modulo p /+9 ), a p9 ^ 1 (modulo 

for all g > 0, and therefore 

(a p9 — 1 )/{a — 1) = 0 (modulo p g ), 

(i a p9 — l)/(a — 1) ^ 0 (modulo p 9+1 ). 

In particular, ( a p * — 1)/(a —1) = 0 (modulo p e ). Now the congruential sequence 
(0, a, 1, p e ) has X n = (a n — l)/(a—1) modp e ; therefore it has a period of length X, 
that is, X n = 0 if and only if n is a multiple of X. Hence p e is a multiple of X. 
This can happen only if X = p g for some g, and the relations in (6) imply that 
X = p e , completing the proof. | 

The proof of Theorem A is now complete. | 

We will conclude this section by considering the special case of pure mul¬ 
tiplicative generators, when c = 0. Although the random number generation 
process is slightly faster in this case, Theorem A shows us that the maximum 
period length cannot be achieved. In fact, this is quite obvious, since the sequence 
now satisfies the relation 


X n _i_i = aX n modra, (7) 

and the value X n = 0 should never appear, lest the sequence degenerate to zero. 
In general, if d is any divisor of m and if X n is a multiple of d, all succeeding 
elements X n +i, X n + 2 , ... of the multiplicative sequence will be multiples of d. 
So when c = 0, we will want X n to be relatively prime to m for all n, and this 
limits the length of the period to at most < p(m ), the number of integers between 
0 and m that are relatively prime to m. 
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It may be possible to achieve an acceptably long period even if we stipulate 
that c = 0. Let us now try to find conditions on the multiplier so that the period 
is as long as possible in this special case. 

According to Lemma Q, the period of the sequence depends entirely on the 
periods of the sequences when m = p e , so let us consider that situation. We 
have X n = a n X 0 modp e , and it is clear that the period will be of length 1 if a is 
a multiple of p, so we take a to be relatively prime to p. Then the period is the 
smallest integer X such that Xo = a x Xo modp e . If the greatest common divisor 
of Xo and p e is pf , this condition is equivalent to 

a x = 1 (modulo p e— f). (8) 

By Euler’s theorem (exercise 1.2.4-28), a^ pe = 1 (modulo p e ~ f); hence X is 
a divisor of 

<p(p € - f ) = p e - f -\p—l). 

When a is relatively prime to m, the smallest integer X for which a x = 1 
(modulo m) is conventionally called the order of a modulo m. Any such value 
of a that has the maximum possible order modulo m is called a primitive element 
modulo m. 

Let X(m) denote the order of a primitive element, i.e., the maximum possible 
order, modulo m. The remarks above show that X(p e ) is a divisor of p e—1 (p— 1); 
with a little care (see exercises 11 through 16 below) we can give the precise value 
of X(m) in all cases as follows: 

X(2) = 1, X(4) = 2, X(2 e ) = 2 e ~ 2 if e > 3. 

X(p e ) = p e-1 (p— 1), if p > 2. (9) 

X(Pi‘ ■-■P e t t ) = lcm(X(pj 1 ),..., X(pJ*)). 

Our remarks may be summarized in the following theorem: 

Theorem B. [R. D. Carmichael, Bull. Amer. Math. Soc. 16 (1910), 232-238.] 
The maximum period possible when c = 0 is \{m), where X(m) is defined in (9). 
This period is achieved if 

i) Xq is relatively prime to m; 

ii) a is a primitive element modulo m. | 

Note that we can obtain a period of length m — 1 if m is prime; this is just one 
less than the maximum length, so for all practical purposes such a period is as 
long as we want. 

The question now is, how can we find primitive elements modulo m? The 
exercises at the close of this section tell us that there is a fairly simple answer 
when m is prime or a power of a prime, namely the results stated in our next 
theorem. 
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Theorem C. The number a is a primitive element modulo p e if and only if 

i) p e = 2, a is odd; or p e — 4, a mod 4 = 3; 

or p e = 8, a mod 8 = 3, 5,7; or p — 2, e > 4, a mod 8 = 3, 5; 

or 

ii) p is odd, e = 1, a ^ 0 (modulo p), and ate— 1 )/*? ^ 1 (modulo p) 
for any prime divisor q of p — 1; 

or 

iii) p is odd, e > 1, a satisfies (ii), and a p ~ 1 ^ 1 (modulo p 2 ). | 

Conditions (ii) and (iii) of this theorem are readily tested on a computer for 
large values of p, by using the efficient methods for evaluating powers discussed 
in Section 4.6.3. Theorem C applies to powers of primes only; if we are given 
values a-j that are primitive modulo pj 3 , it is possible to find a single value a 
such that a = a 3 (modulo p* 3 ), for 1 < j < t, using the “Chinese remainder 
algorithm” discussed in Section 4.3.2, and this number a will be a primitive 
element modulo p e 3 ... p e t l . Hence there is a reasonably efficient way to construct 
multipliers satisfying the condition of Theorem B, for any desired value of m, 
although the calculations can be somewhat lengthy in the general case. 

In the common case m = 2 e , with e > 4, the conditions above simplify to 
the single requirement that a = 3 or 5 (modulo 8). In this case, one-fourth of 
all possible multipliers give the maximum period. 

The second most common case is when m — 10 e . Using Lemmas P and Q, it 
is not difficult to obtain necessary and sufficient conditions for the achievement 
of the maximum period in the case of a decimal computer (cf. exercise 18): 

Theorem D. If m = 10 e , e > 5, c — 0, and Xo is not a multiple of 2 or 5, the 
period of the linear congruential sequence is 5 X 10 e “ 2 if and only if a mod 200 
equals one of the following 32 values : 

3, 11, 13, 19, 21, 27, 29, 37, 53, 59, 61, 67, 69, 77, 83, 91, 109, 117, 

123, 131, 133, 139, 141, 147, 163, 171, 173, 179, 181, 187, 189, 197. | ^ 


EXERCISES 

1. [10] What is the length of the period of the linear congruential sequence with 
X 0 = 5772156648, a = 3141592621, c = 2718281829, and m = 10000000000? 

2. [10] Are the following two conditions sufficient to guarantee the maximum length 
period, when m is a power of 2? “(i) c is odd; (ii) a mod 4 = 1.” 

3. [45] Suppose that m = 10 e , where e > 2, and suppose further that c is odd and 
not a multiple of 5. Show that the linear congruential sequence will have the maximum 
length period if and only if a mod 20 = 1. 
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4. [ M20 ] When a and c satisfy the conditions of Theorem A, and when m — 2 e , 
Xo = 0, what is the value of X 2 «-i? 

5. [14] Find all multipliers a that satisfy the conditions of Theorem A when m — 
2 35 -j- 1. (The prime factors of m may be found in Table 3.2.1.1-1.) 

► 6. [20] Find all multipliers a that satisfy the conditions of Theorem A when m = 
10 6 — 1. (See Table 3.2.1.1-1.) 

► 7. [M23] The period of a congruential sequence need not start with Xo, but we can 
always find indices (x > 0 and X > 0 such that X n +x = X n whenever n > /x, and for 
which p and X are the smallest possible values with this property. (Cf. exercises 3.1-6 
and 3.2.1-1.) If p,j and \j are the indices corresponding to the sequences (Xo modp^, 
amodp^, cmodp^Up^), and if p, and X correspond to the sequence (Xo, a, c, p\ l . ..p®*), 
Lemma Q states that X is the least common multiple of Xi,..., Xt. What is the value 
of ix in terms of the values of pi,... ,/it? What is the maximum possible value of p 
obtainable by varying Xo, a, and c, when m = p\ l ... p® c is fixed? 

8. [ M20 ] Show that if a mod 4 = 3, we have (a 2 ' 1 — l)/(a — 1) = 0 (modulo 2 e ) 
when e > 1. (Use Lemma P.) 

► 9. [M22] (W. E. Thomson.) When c = 0 and m = 2 e > 16, Theorems B and C say 
that the period has length 2 e ~ 2 if and only if the multiplier a satisfies a mod 8 = 3 
or a mod 8 = 5. Show that every such sequence is essentially a linear congruential 
sequence with m = 2 e ~ 2 , having full period, in the following sense: 

a) If X n +i = (4c + l)X n mod 2 e , and X n = 4y n -j- 1, then 

Yn +1 = ((4c + 1 )Yn + c)mod2 e ~ 2 . 

b) If X n +i = (4c — l)X n mod2 e , and X n = ((— l) n (4Y n + l))mod2 e , then 

y n+1 = ((1 — 4c)y n — c) mod 2 e—2 . 

[Note: In these formulas, c is an odd integer. The literature contains several 
statements to the effect that sequences with c = 0 satisfying Theorem B are somehow 
more random than sequences satisfying Theorem A, in spite of the fact that the period is 
only one-fourth as long in the case of Theorem B. This exercise refutes such statements; 
in essence, one gives up two bits of the word length in order to save the addition of c, 
when m is a power of 2.] 

10. [M21] For what values of m is X(m) = <p(ra)? 

► 11. [M28] Let x be an odd integer greater than 1. (a) Show that there exists a unique 
integer / > 1 such that 1 = 2^1 (modulo 2 f+1 ). (b) Given that 1 < x < 2 e — 1 
and that / is the corresponding integer from part (a), show that the order of x modulo 
2 e is . (c) In particular, this proves Theorem C(i). 

12. [ M26 ] Let p be an odd prime. If e > 1, prove that a is a primitive element 
modulo p e if and only if a is a primitive element modulo p and a p—1 ^ 1 (modulo p 2 ). 
(For the purposes of this exercise, assume that X(p e ) = P e ~ 1 (P — !)• This fact is proved 
in exercises 14 and 16 below.) 

13. [M22] Let p be prime. Given that a is not a primitive element modulo p, show 
that either a is a multiple of p or a ^ p ~= 1 (modulo p) for some prime number q 
that divides p— 1. 
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14. [ M18] If e > 1 and p is an odd prime, and if a is a primitive element modulo p, 
prove that either a or a -f- p is a primitive element modulo p e . [Hint: See exercise 12.] 

15. [M29] (a) Let ai, 02 be relatively prime to m, and let their orders modulo m be 

Xi, X 2 , respectively. If X is the least common multiple of Xi and X 2 , prove that a* 1 ^ 2 
has order X modulo ra, for suitable integers /ci, k, 2 - [Hint: Consider first the case that 
Xi is relatively prime to X 2 .] (b) Let X(m) be the maximum order of any element 

modulo m. Prove that X(m) is a multiple of the order of each element modulo m; that 
is, prove that a x ^ = 1 (modulo m) whenever a is relatively prime to m. 

► 16. [M24] Let p be a prime number, (a) Let f(x) = x n c\x n ~' 1 -(- c n , where 

the c’s are integers. Given that a is an integer for which /(a) = 0 (modulo p), show 
that there exists a polynomial q(x) = x n ~ 1 qix n ~ 2 -(-••• + q n — 1 with integer 
coefficients such that f(x) = (x-~ a)q(x) (modulo p) for all integers x. (b) Let f(x) be 
a polynomial as in (a). Show that f(x) has at most n distinct “roots” modulo p; that 
is, there are at most n integers a, with 0 < a < p, such that /(a) — 0 (modulo p). (c) 
Because of exercise 15(b), the polynomial f(x) = x x ^ — 1 has p — 1 distinct roots; 
hence there is an integer a with order p — 1. 

17. [ M26} Not all of the values listed in Theorem D would be found by the text’s 
construction; for example, 11 is not primitive modulo 5 e . How can this be possible, 
when 11 is primitive modulo 10 e , according to Theorem D? Which of the values listed 
in Theorem D are primitive elements modulo both 2 e and 5 e ? 

18. [M25] Prove Theorem D. (Cf. the previous exercise.) 

19. [40] Make a table of some suitable multipliers, a, for each of the values of m listed 
in Table 3.2.1.1-1, assuming that c = 0. 

► 20. [ M24) (G. Marsaglia.) The purpose of this exercise is to study the period length 
of an arbitrary linear congruential sequence. Let Y n = 1 a + • • • + a n ~ l , so that 
X n = ( AY n + Xo)modm for some constant A by Eq. 3. 2.1-8. (a) Prove that the 
period length of (X n ) is the period length of (Yninodm'), where m' — m/gcd(^4, m). 
(b) Prove that the period length of (Y n modp e ) satisfies the following when p is prime: 
(i) If a modp = 0, it is 1. (ii) If a modp = 1, it is p e , except when p — 2 and e > 2 and 
a mod 4 = 3. (iii) If p = 2, e > 2, and a mod 4 = 3, it is twice the order of a modulo p e 
(cf. exercise 11), unless a = —1 (modulo 2 e ) when it is 2. (iv) If a modp > 1, it is the 
order of a modulo p e . 

21. [M25] In a linear congruential sequence of maximum period, let Xo = 0 and let s 
be the least positive integer such that a s = 1 (modulo m). Prove that gcd(X s , m) = $. 


3.2.1.3. Potency. In the preceding section, we showed that the maximum period 
can be obtained when b = a — 1 is a multiple of each prime dividing m; and 
b must also be a multiple of 4 if m is a multiple of 4. If z is the radix of the 
machine being used—so that 2 = 2 for a binary computer, and 2 = 10 for a 
decimal computer—and if m is the word size 2 e , the multiplier. 

a = z k + 1, 2 < k < e (1) 

satisfies these conditions. Theorem 3.2.1.2A also says that we may take c = 1. 
The recurrence relation now has the form 

X„ +1 = ((z* + 1)X„ + 1) mod z e , 


( 2 ) 
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and this equation suggests that we can avoid the multiplication; merely shifting 
and adding will suffice. 

For example, suppose that a — B 2 -)- 1, where B is the byte size of MIX. 
The code 

LDA X 
SLA 2 

ADD X ^ 

INCA 1 

can be used in place of the instructions given in Section 3.2.1.1, and the execution 
time decreases from 16 u to lu. 

For this reason, multipliers having form (1) have been widely discussed in the 
literature, and indeed they have been recommended by many authors. However, 
the early years of experimentation with this method showed that multipliers 
having the simple form in (1) should be avoided. The generated numbers just 
aren’t random enough. 

Later in this chapter we shall be discussing some rather sophisticated theory 
that accounts for the badness of all the linear congruential random number 
generators known to be bad. However, some generators (such as (2)) are suffi¬ 
ciently awful that a comparatively simple theory can be used to dispense with 
them. This simple theory is related to the concept of “potency,” which we shall 
now discuss. 

The potency of a linear congruential sequence with maximum period is 
defined to be the least integer s such that 

b s = 0 (modulo m). (4) 

(Such an integer 5 will always exist when the multiplier satisfies the conditions 
of Theorem 3.2.1.2A, since b is a multiple of every prime dividing m.) 

We may analyze the randomness of the sequence by taking Xo = 0, since 0 
occurs somewhere in the period. With this assumption, we have 

X n = ((a n — l)c/b)modm, 

and if we expand a n — 1 = (b -(- l) n — 1 by the binomial theorem, we find that 





& + ••• + 




mod rn. 



All terms in b s , 6 s-1-1 , etc., may be ignored, since they are multiples of m. 

Equation (5) can be instructive, so we shall consider some special cases. 
If a — 1, the potency is 1; and X n = cn (modulo m), as we have already 
observed, so the sequence is surely not random. If the potency is 2, we have 
X n = cn + c6(2), and again the sequence is not very random; indeed, 


X n+1 X n — c -f- cbn 
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in this case, so the differences between consecutively generated numbers change 
in a simple way from one value of n to the next. The point (X n ,X n _|_i,X n _|_ 2 ) 
always lies on one of the four planes 

x-2y J r z = d-\-m, x — 2 y -\- z — d — m, 

x — 2y-\ r z — d, x — 2y-\-z = d — 2m, 

in three-dimensional space, where d — cb mod m. 

If the potency is 3, the sequence begins to look somewhat more random, 
but there is a high degree of dependency between X n , X n +i, and X n+ 2 ; tests 
show that sequences with potency 3 are still not sufficiently good. Reasonable 
results have been reported when the potency is 4 or more, but these have been 
disputed by other people. A potency of at least 5 would seem to be required for 
sufficiently random values. 

Suppose, for example, that m = 2 35 and a = 2 k + 1. Then b = 2 fc , so 

we find that when k > 18, the value b 2 = 2 2k is a multiple of m: the potency 

is 2. If k — 17,16,..., 12, the potency is 3, and a potency of 4 is achieved for 
k = 11,10,9. The only acceptable multipliers, from the standpoint of potency, 
therefore have k < 8. This means a < 257, and we shall see later that small 
multipliers are also to be avoided. We have now eliminated all multipliers of the 
form 2 k -\-l when m = 2 35 . 

When m is equal to w ^ 1, where w is the word size, m is generally 
not divisible by high powers of primes, and a high potency is impossible (see 
exercise 6). So in this case, the maximum-period method should not be used; 
the pure-multiplication method with c = 0 should be applied instead. 

It must be emphasized that high potency is necessary but not sufficient for 
randomness; we use the concept of potency only to reject impotent generators, 
not to accept the potent ones. Linear congruential sequences should pass the 
“spectral test” discussed in Section 3.3.4 before they are considered to be accept¬ 
ably random. 


EXERCISES 

1. [M10] Show that, no matter what the byte size B of MIX happens to be, the code 
(3) yields a random number generator of maximum period. 

2. [10] What is the potency of the generator represented by the MIX code (3)? 

3. [11] When m — 2 35 , what is the potency of the linear congruential sequence with 
a = 3141592621? What is the potency if the multiplier is a = 2 23 -f- 2 14 -)— 2 2 -f- 1? 

4. [15] Show that if m = 2 e > 8, maximum potency is achieved when a mod 8 = 5. 

5. [M20] Given that m — pi 1 ... p\ % and a = 1 + kp { x ... pf*, where a satisfies the 
conditions of Theorem 3.2.1.2A and k is relatively prime to m, show that the potency 
is max([ei//i],..., \e t /ft])- 

► 6. [20] Which of the values of m = w ± 1 in Table 3.2.1.1-1 can be used in a linear 
congruential sequence of maximum period whose potency is 4 or more? (Use the result 
of exercise 5.) 
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7. [M20\ When a satisfies the conditions of Theorem 3.2.1.2A, it is relatively prime 
to m; hence there is a number a' such that aa' = 1 (modulo m). Show that a' can be 
expressed simply in terms of b. 

► 8. [M26] A random number generator defined by X n +i = (2 17 + 3)X n mod 2 35 and 
Xo = 1 was subjected to the following test: Let Y n = LlOX n /2 35 j; then Y n should be 
a random digit between 0 and 9, and the triples (Y^n, Y 3 n+i, Y$n+ 2 ) should take on 
each of the 1000 possible values from (0, 0, 0) to (9, 9, 9) with equal probability. But 
with 30000 values of n tested, some triples hardly ever occurred, and others occurred 
much more often than they should have. Can you account for this failure? 


3.2.2. Other Methods 

Of course, linear congruential sequences are not the only sources of random num¬ 
bers that have been proposed for computer use. In this section we shall review 
the most significant alternatives; some of these methods are quite important, 
while others are interesting chiefly because they are not as good as a person 
might expect. 

One of the common fallacies encountered in connection with random number 
generation is the idea that we can take a good generator and modify it a little, in 
order to get an “even-more-random” sequence. This is often false. For example, 
we know that 

X n+1 — (aX n + c) mod m (1) 

leads to reasonably good random numbers; wouldn’t the sequence produced by 

X n+ 1 = ((aX n ) mod (m -}- 1) -(- c) mod m (2) 

be even more random? The answer is, the new sequence is probably a great deal 
less random. For the whole theory breaks down, and in the absence of any theory 
about the behavior of the sequence (2), we come into the area of generators of 
the type X n _|_i = f(X n ) with the function / chosen at random; exercises 3.1-11 
through 3.1-15 show that these sequences probably behave much more poorly 
than the sequences obtained from the more disciplined function (1). 

Let us consider another approach, in an attempt to get “more random” 
numbers. The linear congruential method can be generalized to, say, a quadratic 
congruential method: 


X n+ i = (dX 2 n + aX n + c) mod m. (3) 

Exercise 8 generalizes Theorem 3.2.1.2A to obtain necessary and sufficient con¬ 
ditions on a, c, and d such that the sequence defined by (3) has a period of the 
maximum length m; the restrictions are not much more severe than in the linear 
method. 
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An interesting quadratic method has been proposed by R. R. Coveyou when 
m is a power of two; let 

X 0 mod 4 = 2, X n +i = X n (X n -f 1) mod 2 e , n > 0. (4) 

This sequence can be computed with about the same efficiency as (1), without any 
worries of overflow. It has an interesting connection with von Neumann’s original 
middle-square method: If we let Y n be 2 e X n , so that Y n is a double-precision 
number obtained by placing e zeros to the right of the binary representation of 
X n , then Yn+i consists of precisely the middle 2e digits of y 2 + 2 e Y n ! In other 
words, Coveyou’s method is almost identical to a somewhat degenerate double¬ 
precision middle-square method, yet it is guaranteed to have a long period; 
further evidence of its randomness is proved in exercise 3.3.4-25. 

Other generalizations of Eq. (1) also suggest themselves; for example, we 
might try to extend the period length of the sequence. The period of a linear 
congruential sequence is extremely long; when m is approximately the word size 
of the computer, we usually get periods on the order of 10 9 or more, so that 
typical calculations will use only a very small portion of the sequence. On the 
other hand, when we discuss the idea of “accuracy” in Section 3.3.4 we will 
see that the period length influences the degree of randomness achievable in 
a sequence. Therefore it is occasionally desirable to seek a longer period, and 
several methods are available for this purpose. One technique is to make X n _^i 
depend on both X n and X n _x, instead of just on X n ; then the period length 
can be as high as m 2 , since the sequence will not begin to repeat until we have 
(X n _|_XjX n _j_x + l) — (X n ,X n+1 ). 

The simplest sequence in which X n+1 depends on more than one of the 
preceding values is the Fibonacci sequence, 


X n+ 1 = (X n -(- X n _i) modm. (5) 

This generator was considered in the early 1950s, and it usually gives a period 
length greater than m; but tests have shown that the numbers produced by the 
Fibonacci recurrence (5) are definitely not satisfactorily random, and so at the 
present time the main interest in (5) as a source of random numbers is that it 
makes a nice “bad example.” We may also consider generators of the form 

X n+1 = (X n -f X n _ fc ) mod m, (6) 

when k is a comparatively large value. These were introduced by Green, Smith, 
and Klem [JACM 6 (1959), 527-537], who reported that, when k < 15, the 
sequence fails to pass the “gap test” described in Section 3.3.2, although when 
k = 16 the test was satisfactory. 

A much better type of additive generator was devised in 1958 by G. J. 
Mitchell and D. P. Moore [unpublished], who suggested the somewhat unusual 
sequence defined by 

n > 55, (7) 


X n = (X n _ 24 -hX n _ 55 )mod m, 
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where m is even, and where X 0 , ..., X 54 are arbitrary integers not all even. The 
constants 24 and 55 in this definition were not chosen at random, they are special 
values that happen to have the property that the least significant bits ( X n mod 2) 
will have a period of length 2 55 — 1. Therefore the sequence (X n ) must have a 
period at least this long. Exercise 11, which explains how to calculate the period 
length of such sequences, proves that (7) has a period of length 2 f (2 55 — 1) for 
some /, 0 < / < e, when m = 2 e . 

At first glance Eq. (7) may not seem to be extremely well suited to machine 
implementation, but in fact there is a very efficient way to generate the sequence 
using a cyclic list: 

Algorithm A ( Additive number generator). Memory cells Y[l], Y[ 2], ..., Y[55] 
are initially set to the values X 54 , X 53 , ..., X 0 , respectively; j is initially equal 
to 24 and k is 55. Successive performances of this algorithm will produce the 
numbers X 55 , X 56 , ... as output. 

Al. [Add.] (If we are about to output X n at this point, Y[j] now equals X n _ 2 4 
and Y[k\ equals X n _ 55 .) Set Y[k] <— (Y[fc]+Y[j])mod2 e , and output Y[k]. 

A2. [Advance.] Decrease j and k by 1. If now j = 0, set j <— 55; otherwise if 
k = 0, set k <— 55. | 

This algorithm in MIX is simply the following: 

Program A ( Additive number generator). Assuming that index registers 5 and 6 
are not touched by the remainder of the program in which this routine is 
embedded, the following code performs Algorithm A and leaves the result in 
register A. rI5 = j, rI6 = k. 


LDA 

Y,6 

Al. Add. 


ADD 

Y,5 

Yk + Yj (overflow 

possible) 

STA 

Y,6 

- Y k . 


DEC 5 

1 

A2. Advance, i <- 

-j— 1. 

DEC6 

1 

k <— k — 1. 


J5P 

*+2 



ENT 5 

55 

If j = 0, set j <- 

55. 

J6P 

*+2 



ENT6 

55 

If k = 0, set k <— 

55. | 


This generator is usually faster than the previous methods we have been 
discussing, since it does not require any multiplication. Besides its speed, it has 
the longest period we have seen yet; and it has consistently produced reliable 
results, in extensive tests since its invention in 1958. Furthermore, as Richard 
Brent has observed, it can be made to work correctly with floating point numbers, 
avoiding the need to convert between integers and fractions (cf. exercise 23). 
Therefore it may well prove to be the very best source of random numbers for 
practical purposes. The only reason it is difficult to recommend sequence (7) 
wholeheartedly is that there is still very little theory to prove that it does or 
does not have desirable randomness properties; essentially all we know for sure 
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Table 1 

SUBSCRIPT PAIRS YIELDING LONG PERIODS MOD 2 


(1, 2) 

(1, 15) 

(5, 23) 

(7, 31) 

(5, 47) 

(21, 52) 

(18, 65) 

(28, 73) 

(2, 93) 

(1.3) 

(4. 15) 

(9, 23) 

(13, 31) 

(14, 47) 

(24, 55) 

(32, 65) 

(31, 73) 

(21, 94) 

(1, 4) 

(7, 15) 

(3, 25) 

(13, 33) 

(20, 47) 

(7, 57) 

(9, 68) 

(9, 79) 

(11, 95) 

(2. 5) 

(3, 17) 

(7, 25) 

(2, 35) 

(21, 47) 

(22, 57) 

(33, 68) 

(19, 79) 

(17, 95) 

(T 6) 

(5, 17) 

(3, 28) 

(11, 36) 

(9, 49) 

(19, 58) 

(6, 71) 

(4, 81) 

(6, 97) 

(1. 7) 

(6, 17) 

(9, 28) 

(4, 39) 

(12, 49) 

(1, 60) 

(9, 71) 

(16, 81) 

(12, 97) 

(3, 7) 

(7, 18) 

(13, 28) 

(8, 39) 

(15, 49) 

(11, 60) 

(18, 71) 

(35, 81) 

(33, 97) 

(4, 9) 

(3, 20) 

(2, 29) 

(14, 39) 

(22, 49) 

(1, 63) 

(20, 71) 

(13, 84) 

(34, 97) 

(3, 10) 

(2, 21) 

(3, 31) 

(3, 41) 

(3, 52) 

(5, 63) 

(35, 71) 

(13, 87) 

(11, 98) 

(2, 11) 

(1, 22) 

(6, 31) 

(20, 41) 

(19, 52) 

(31, 63) 

(25, 73) 

(38, 89) 

(27, 98) 

For each pair (l, k), 

the pair (k 

— 1, k) is also valid (see exercise 

24), hence 

only values 

3 of Z < k/2 


are listed here. For extensions of this table, see N. Zierler and J. Brillhart, Information and 
Control 13 (1968), 541-554; 14 (1969), 566-569; 15 (1969), 67-69. 


is that the period is very long, and this is not enough. John Reiser (Ph. D. 
thesis, Stanford Univ., 1977) has shown, however, that an additive sequence like 
(7) will be well distributed in high dimensions, provided that a certain plausible 
conjecture is true (cf. exercise 26). 

The fact that the special numbers (24, 55) in (7) work so well follows from 
theoretical results developed in some of the exercises below. Table 1 lists all 
pairs (/, k) for which the sequence X n = (X n _; +X n _fc) mod 2 has period length 
2 k — 1, when k < 100. The pairs (/, k) for small as well as larger k are shown, for 
the sake of completeness; the pair (1, 2) corresponds to the Fibonacci sequence 
mod 2, whose period has length 3. However, only pairs (l, k) for relatively large 
k should be used to generate random numbers in practice. 

Instead of considering only additive sequences, we can construct useful ran¬ 
dom number generators by taking general linear combinations of X n _ 1} ..., 
X n —k for small k. In this case the best results occur when the modulus m is a 
large prime; for example, m can be chosen to be the largest prime number that 
fits in a single computer word (see Table 4.5.4-1). When m — p is prime, the 
theory of finite fields tells us that it is possible to find multipliers a\ ,..., such 
that the sequence defined by 

X n = {cL\X n —i -j- • • * -J- cifcXyi —fc)modp (8) 

has period length p k — 1; here X 0 ,... ,Xk—i may be chosen arbitrarily but not 
all zero. (The special case k = 1 corresponds to a multiplicative congruential 
sequence with prime modulus, with which we are already familiar.) The constants 
< 2 i,..., afc in (8) have the desired property if and only if the polynomial 

f(x) = x k — a 1 x k ~ 1 -* — a k (9) 

is a “primitive polynomial modulo p,” that is, if and only if this polynomial 
has a root that is a primitive element of the field with p k elements (see exercise 
4.6.2-16). 
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Of course, the mere fact that suitable constants ai, ..., a k exist giving a 
period of length p k — 1 is not enough for practical purposes; we must be able 
to find them, and we can’t simply try all p k possibilities, since p is on the order 
of the computer’s word size. Fortunately there are exactly (p(p k — l)/k suitable 
choices of (fli,..., a k ), so there is a fairly good chance of hitting one after making 
a few random tries. But we also need a way to tell quickly whether or not (9) 
is a primitive polynomial modulo p; it is certainly unthinkable to generate up to 
p k — 1 elements of the sequence and wait for a repetition! Methods of testing 
for primitivity modulo p are discussed by Alanen and Knuth in Sankhya (A) 26 
(1964), 305-328; the following criteria can be used: Let r — (p k — 1 )/(p — 1). 

i) (—l) fc—1 dfc must be a primitive root modulo p. (Cf. Section 3.2.1.2.) 

ii) The polynomial x r must be congruent to (— l) k ~ 1 a fc , modulo f(x) and p. 

iii) The degree of x r/q modf(x), using polynomial arithmetic modulo p, must 

be positive, for each prime divisor q of r. 

Efficient ways to compute the polynomial x n mod/(:r), using polynomial 
arithmetic modulo a given prime p, are discussed in Section 4.6.2. 

In order to carry out this test, we need to know the prime factorization of 
r = (p k — 1 )/(p — 1), and this is the limiting factor in the calculation; r can 
be factored in a reasonable amount of time when k = 2, 3, and perhaps 4, but 
higher values of k are difficult to handle when p is large. Even k = 2 essentially 
doubles the number of “significant random digits” over what is achievable with 
k — 1, so larger values of k will rarely be necessary. 

An adaptation of the spectral test (Section 3.3.4) can be used to rate the 
sequence of numbers generated by (8); see exercise 3.3.4-26. The considerations 
of that section show that we should not make the obvious choice of a\ = +1 or 
—1 when it is possible to do so; it is better to pick large, essentially “random,” 
values of ai,...,a k that satisfy the conditions, and to verify the choice by 
applying the spectral test. A significant amount of computation is involved in 
finding a ± f ..., a k , but all known evidence indicates that the result will be a very 
satisfactory source of random numbers. We essentially achieve the randomness of 
a linear congruential generator with k- tuple precision, using only single precision 
operations. 

The special case p = 2 is of independent interest. Sometimes a random 
number generator is desired that merely produces a random sequence of bits — 
zeros and ones—instead of fractions between zero and one. There is a simple way 
to generate a highly random bit sequence on a binary computer, manipulating 
fc-bit words: Start with an arbitrary nonzero binary word X. To get the next 
random bit of the sequence, do the following operations, shown in Mix’s language 
(see exercise 16): 


LDA X (Assume that overflow is now “off.”) 

ADD X Shia lea one bit. 

JNOV *+2 Jump if high bit was originally zero. 

XOR A Otherwise adjust number with “exclusive or.” 

STA X | 


( 10 ) 
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The fourth instruction here is the “exclusive or” operation found on nearly 
all binary computers (cf. exercise 2.5-28 and Section 7.1); it changes each bit 
position in which location A has a “1” bit. The value in location A is the binary 
constant (ai... a*;) 2 , where x k — a\X k ~ 1 — • • • — a*, is a primitive polynomial 
modulo 2 as above. After the code (10) has been executed, the next bit of 
the generated sequence may be taken as the least significant bit of word X (or, 
alternatively, we could consistently use the most significant bit of X, if it were 
more convenient to do so). 

For example, consider Fig. 1, which illustrates the sequence generated for 
k = 4 and CONTENTS (A) = (0011) 2 . This is, of course, an unusually small 
value for k. The right-hand column shows the sequence of bits of the sequence, 
namely 1101011110001001..., repeating in a period of length 2 fc — 1 = 15. This 
sequence is quite random, considering that it was generated with only four bits 
of memory; to see this, consider the adjacent sets of four bits occurring in the 
period, namely 1101, 1010, 0101, 1011, 0111, 1111, 1110, 1100, 1000, 0001, 0010, 
0100, 1001, 0011, 0110. In general, every possible adjacent set of k bits occurs 
exactly once in the period, except the set of all zeros, since the period length is 
2 k — 1; thus, adjacent sets of k bits are essentially independent. We shall see 
in Section 3.5 that this is a very strong criterion for randomness when k is, say, 
30 or more. Theoretical results illustrating the randomness of this sequence are 
given in an article by R. C. Tausworthe, Math. Comp. 19 (1965), 201-209. 

Primitive polynomials modulo 2 of degree <100 have been tabulated by 
E. J. Watson, Math. Comp. 16 (1962), 368-369. When k — 35, we may take 

CONTENTS (A) = (00000000000000000000000000000000101) 2 , 

but the considerations of exercises 18 and 3.3.4-26 imply that it would be better 
to find “random” constants that define primitive polynomials modulo 2. 

Caution: Several people have been trapped into believing that this random 
bit-generation technique can be used to generate random whole-word fractions 
(.XqXi ...Xfc_i) 2 , (.XjfcXfe_|_i.. .X 2 /c_-.i) 2 , ...; but it is actually a poor source of 
random fractions, even though the bits are individually quite random. Exercise 
18 explains why. 

Mitchell and Moore’s additive generator (7) is essentially based on the con¬ 
cept of primitive polynomials; the polynomial x 55 -|- x 24 -)- 1 is primitive, and 
Table 1 is essentially a listing of all the primitive trinomials modulo 2. A 
generator almost identical to that of Mitchell and Moore was independently dis¬ 
covered in 1971 by T. G. Lewis and W. H. Payne [cf. JACM 20 (1973), 456-468], 
but using “exclusive or” instead of addition so that the period is exactly 2 55 — 1; 
each bit position in their generated numbers runs through the same periodic 
sequence, but has its own starting point. (See Bright and Enison, Computing 
Surveys 11 (1979), 357-370, for further discussion of Lewis and Payne’s method.) 

We have now seen that sequences with 0 < X n < m and period m k — 1 
can be found, when X n is a suitable function of X n _i,... ,X 7l _ fc and when m 
is prime. The highest conceivable period for any sequence defined by a relation 
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Fig. 1 . Successive contents of the computer word X in the binary 
method, assuming that k — 4 and CONTENTS (A) = (0011)2- 


X n = /(X n _!,...,X n _ fc ), 0 < X n < m, (11) 

is easily seen to be m k . M. H. Martin [BuiJ. Amer. Math. Soc. 40 (1934), 859-864] 
was the first person to show that functions achieving this maximum period are 
possible for all m and k; his method is easy to state, but it is unfortunately not 
suitable for programming (see exercise 17). From a computational standpoint, 
the simplest known functions / that yield the maximum period m k appear in 
exercise 21; the corresponding programs are, in general, not as efficient for 
random number generation as other methods we have described, but they do 
give demonstrable randomness when the period as a whole is considered. 

Another important class of techniques deals with the combination of random 
number generators, to get “more random” sequences. There will always be people 
who feel that the linear congruential methods, additive methods, etc., are all 
too simple to give sufficiently random sequences; and it may never be possible to 
prove that their skepticism is unjustified (although we believe it is), so it is pretty 
useless to argue the point. There are reasonably efficient methods for combining 
two sequences into a third one that should be haphazard enough to satisfy all 
but the most hardened skeptic. 

Suppose we have two sequences Xo, X \,... and Yq> Y \,... of random numbers 
between 0 and m — 1, preferably generated by two unrelated methods. One 
suggestion has been to add them together, mod m, obtaining the sequence Z n = 
(X n -f- Y n )modm; in this case, the period will be quite long if the period lengths 
of (X n ) and (Y n ) are relatively prime to each other (see exercise 13). Another 
approach, based on circular shifting and “exclusive or-ing”, has been suggested 
by W. J. Westlake, JACM 14 (1967), 337-340. 

A considerably different method has been suggested by M. D. MacLaren 
and G. Marsaglia [JACM 12 (1965), 83-89; CACM 11 (1968), 759], who use one 
random sequence to permute the elements of another: 
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ion 

0101 

1010 

0111 

1110 

mi 

1101 

1001 

0001 

0010 

0100 

1000 

0011 

0110 

1100 

1011 


of the form 
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Algorithm M ( Randomizing by shuffling). Given methods for generating two 
sequences (X n ) and (Y n ), this algorithm will successively output the terms of a 
“considerably more random” sequence. We use an auxiliary table V[0], V[l], 
..., V[k — 1], where k is some number chosen for convenience, usually in the 
neighborhood of 100. Initially, the F-table is filled with the first k values of the 
X-sequence. 

Ml. [Generate X,Y.\ Set X and Y equal to the next members of the sequences 
(X n ) and (Y n ), respectively. 

M2. [Extract j] Set j <— [kY/m\, where m is the modulus used in the sequence 
(In); that is, j is a random value, 0 < j < k, determined by Y. 

M3. [Exchange.] Output V[j] and then set V[j] <— X. | 

As an example, assume that Algorithm M is applied to the following two 
sequences, with k = 64: 

Xq = 5772156649, X n+1 = (3141592653X„ + 2718281829) mod 2 35 ; 

Y 0 — 1781072418, Y n+i = (2718281829^ + 3141592653) mod 2 35 . 

On intuitive grounds it appears safe to predict that the sequence obtained by 
applying Algorithm M will satisfy virtually anyone’s requirements for random¬ 
ness in a computer-generated sequence, because the relationship between nearby 
terms of the output has been almost entirely obliterated. Furthermore, the time 
required to generate this sequence is only slightly more than twice as long as it 
takes to generate the sequence { X n ) alone. 

Exercise 15 proves that the period length of Algorithm M’s output will be the 
least common multiple of the period lengths of (X n ) and (Y n ), in most situations 
of practical interest. In particular, the above example will have a period of 
length 2 35 . 

However, there is an even better way to shuffle the elements of a sequence, 
discovered by Carter Bays and S. D. Durham [ACM Trans. Math. Software 2 
(1976), 59-64]. Their approach, although it appears to be superficially similar to 
Algorithm M, can give surprisingly better performance even though it requires 
only one input sequence (X n ) instead of two: 

Algorithm B ( Randomizing by shuffling). Given a method for generating a se¬ 
quence (X n ), this algorithm will successively output the terms of a “considerably 
more random” sequence, using an auxiliary table V[0], V[l], ..., V[k — 1] as 
in Algorithm M. Initially the V-table is filled with the first k values of the 
X-sequence, and an auxiliary variable Y is set equal to the (k -f- l)st value. 

Bl. [Extract j] Set j <— [kY/m\, where m is the modulus used in the sequence 
(X n ); that is, j is a random value, 0 < j < k, determined by Y. 

B2. [Exchange.] Set Y <- V[j\, output Y, and then set V[j] to the next member 
of the sequence (X n ). | 
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The reader is urged to work exercise 3, in order to get a feeling for the 
difference between Algorithms M and B. 

On MIX we may implement Algorithm B by taking k equal to the byte size, 
obtaining the following simple generation scheme once the initialization has been 
done: 


LD6 

Y(1:1) 

j high-order byte of Y 

LDA 

X 

rA 4- X n . 

INCA 

1 

(cf. exercise 3.2.1.1-1) 

MUL 

A 

rX 4— X n +i. 

STX 

LDA 

X 

V, 6 

“n <— n + 1.” 

STA 

Y 

Y - V[j}. 

STX 

V, 6 

V[j\^X n . 1 


The output appears in register A. Note that Algorithm B requires only four 
instructions of overhead per generated number. 

F. Gebhardt [Math. Comp. 21 (1967), 708-709] found that satisfactory 
random sequences were produced by Algorithm M even when it was applied to 
a sequence as nonrandom as the Fibonacci sequence, with X n = F 2 n modm 
and Y n = i^n-pi modm. However, it is also possible for Algorithm M to 
produce a sequence less random than the original sequences, if (X n ) and (Y n ) 
are strongly related, as shown in exercise 3. Such problems do not seem to 
arise with Algorithm B. Since Algorithm B won’t make a sequence any less 
random, and since it probably enhances the randomness substantially with very 
little extra cost, it can be recommended for use in combination with any other 
random number generator. 

EXERCISES 

► 1. [12] In practice, we form random numbers using X n +i = ( aX n c)modm, 
where the X’s are integers, afterwards treating them as the fractions U n = X n /m. 

| The recurrence relation for U n is actually 

I 

I f/n+i =(a(7 n + c/m)modl. 

I 

Discuss the generation of random sequences using this relation directly, by making use 
I of floating point arithmetic on the computer. 

j ► 2. [M20] A good source of random numbers will have X n — i < X n +i < X n about 
one-sixth of the time, since each of the six possible relative orders of X n —\, X n , and 
X n+1 should be equally probable. However, show that the above ordering never occurs 
if the Fibonacci sequence (5) is used. 

3. [23] (a) What sequence comes from Algorithm M if 
| X 0 = 0, X n +i = (5X n + 3) mod 8, Yo = 0, Y n +i = (5V n + l)mod8, 

I 

I 

and k = 4? (Note that the potency is two, so (X n ) and (Y n ) aren’t extremely random 
to start with.) (b) What happens if Algorithm B is applied to this same sequence (X n ) 
j with k = 4? 

I 

i 


[ 
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4. [00] Why is the most significant byte used in the first line of program (13), instead 
of some other byte? 

► 5. [20] Discuss using X n = Y n in Algorithm M, in order to improve the speed of 
generation. Is the result analogous to Algorithm B? 

6. [10] In the binary method (10), the text states that the low-order bit of X is 
random, if the code is performed repeatedly. Why isn’t the entire word X random? 

7. [20] Show that the full sequence of length 2 e (i.e., a sequence in which each of 
the 2 e possible sets of e adjacent bits occurs just once in the period) may be obtained 
if program (10) is changed to the following: 

LDA X 
JANZ *+2 
LDA A 
ADD X 
JNOV *+3 
JAZ *+2 
XOR A 

STA X | 

8. [ M39 ] Prove that the quadratic congruential sequence (3) has period length m if 
and only if the following conditions are satisfied: 

i) c is relatively prime to m; 

ii) d and a — 1 are both multiples of p, for all odd primes p dividing m; 

iii) d is even, and d = a — 1 (modulo 4), if m is a multiple of 4; 
d = a — 1 (modulo 2), if m is a multiple of 2; 

iv) either d = 0 or both a = 1 and cd = 6 (modulo 9), if m is a multiple of 9. 

[Hint: The sequence defined by Xo = 0, X n +i — dX 2 -f- aX n + c has a period of 
length m, modulo m, only if its period length is r modulo any divisor r of m.j 

► 9. [M24] (R- R. Coveyou.) Use the result of exercise 8 to prove that the modified 
middle-square method (4) has a period of length 2 e ~ 2 . 

10. [M29] Show that if Xo and Xi are not both even and if m = 2 e , the period of 
the Fibonacci sequence (5) is 3 • 2 e ~~ 1 . 

11 . [M36] The purpose of this exercise is to analyze certain properties of integer 
sequences satisfying the recurrence relation 


X n — aiX n _i aXn-4, u > k\ 

if we can calculate the period length of this sequence modulo m = p e , when p is prime, 
the period length with respect to an arbitrary modulus m is the least common multiple 
of the period lengths for the prime power factors of m. 

a) If f{z), a(z), b(z) are polynomials with integer coefficients, let us write a(z) = b(z) 
(modulo f(z ) and m) if a(z ) = b(z) + f(z)u(z) + mv(z) for some polynomials 
u(z), v(z ) with integer coefficients. Prove that when /( 0) = 1 and p e > 2, “If 
z x = 1 (modulo f(z) and p e ), z x ^ 1 (modulo f[z) and p e+1 ), then z pX = 1 
(modulo f(z) and p e+1 ), z pX ^ 1 (modulo f(z) and p e_f " 2 ).” 

b) Let f(z) = 1 — a\z — • • • — akZ k , and let 

G(z) = 1 /f{z) = Aq -I - A\Z -j- A 2 Z 2 H-. 
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Let X(ra) denote the period length of (A n modm). Prove that X(m) is the smallest 
positive integer X such that z x = 1 (modulo f(z ) and m). 

c) Given that p is prime, p e > 2, and X(p e ) jZ X(p e+1 ), prove that X(p e+r ) = p r X(p e ) 
for all r > 0. (Thus, to find the period length of the sequence (A n mod2 e ), we 
can compute X(4), X( 8 ), X(16), ... until we find the smallest e > 3 such that 
X(2 e ) yZ X(4); then the period length is determined mod 2 e for all e. Exercise 
4.6.3-26 explains how to calculate X n for large n in O(logn) operations.) 

d) Show that any sequence of integers satisfying the recurrence stated at the beginning 
of this exercise has the generating function g(z)/f(z), for some polynomial g(z) 
with integer coefficients. 

e) Given that the polynomials f(z ) and g(z ) in part (d) are relatively prime modulo 
p (cf. Section 4.6.1), prove that the sequence (X n modp e ) has exactly the same 
period length as the special sequence {A n modp e ) in (b). (No longer period could 
be obtained by any choice of Xo,..., Xk—i, since the general sequence is a linear 
combination of “shifts” of the special sequence.) [Hint: By exercise 4.6.2-22 
(Hensel’s lemma), there exist polynomials such that a(z)f(z) -\- b(z)g(z ) = 1 
(modulo p e ).] 

► 12. [M28] Find integers Xo, Xi, a, b, and c such that the sequence 

X n+ 1 = (aX n -f bXn—i + c)mod 2 e , n > 1 , 

has the longest period length of all sequences of this type. [Hint: It follows that 
Xn +2 = ((a + l)X n +i -j- (b — a)X n — 5X n _i) mod 2 e ; see exercise 11(c).] 

13. [M20] Let (X n ) and ( Y n ) be sequences of integers mod m with periods of lengths 
Xi and X 2 , and form the sequence Z n = (X n -(-Xi)modm. Show that if Xi and X 2 are 
relatively prime, the sequence (Z n ) has a period of length X 1 X 2 . 

14. [M24] Let X n , Y n , Z n , Xi, X 2 be as in the previous exercise. Suppose that the 
prime factorization of Xi is 2 e2 3 e3 5 es ..., and similarly suppose that X 2 = 2* 2 3 f3 5 f5 ... . 
Let g p = (max(e p , f p ) if e p yZ f p> otherwise 0), and let Xo = 2 92 3 93 5 95 ... . Show that 
the period length X' of the sequence ( Z n ) is a multiple of Xo, but it is a divisor of 
X = lcm(Xi, X 2 ). In particular, X' = X if (e p yZ f p or e p = f p = 0) for each prime p. 

15. [M27] Let the sequence (X n ) in Algorithm M have period length Xi, and assume 
that all elements of its period are distinct. Let q n = min{ r | r > 0 and [kY n — r /Tn\ = 
[kY n /m\}. Assume that q n < 5 X 1 for all n > n 0 , and that the sequence (q n ) has 
period length X 2 - Let X be the least common multiple of Xi and X 2 . Prove that the 
output sequence (Z n ) produced by Algorithm M has a period of length X. 

► 16. [M28] Let CONTENTS (A) in method (10) be (ai «2 ... ak )2 in binary notation. Show 
that the generated sequence of low-order bits Xo, Xi, ... satisfies the relation 

Xn — (< 2 iX n —1 -(- a. 2 Xn —2 “h * •' ~h CLkX n —k) mod 2. 

[This may be regarded as another way to define the sequence, although the connection 
between this relation and the efficient code ( 10 ) is not apparent at first glance!] 

17. [MSS] (M. H. Martin, 1934.) Let m and k be positive integers, and let Xi = 
X 2 = • • • = Xfc = 0. For all n > 0, set X n +fc equal to the largest nonnegative value 
y < m such that the /c-tuple (X n +i,... ,X n +fc— i,y) has not already occurred in the 
sequence; in other words, (X n +i,... ,X n +k—i,y) must differ from (X r +i, ... ,X r +k) 
for 0 < r < n. In this way, each possible fc-tuple will occur at most once in the 
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sequence. Eventually the process 'will terminate, when we reach a value of n such 
that (X n +i,... ,X n +k—i,y) has already occurred in the sequence for all nonnegative 
y < m. For example, if m = k = 3 the sequence is 00022212202112102012001110100, 
and the process terminates at this point, (a) Prove that when the sequence terminates, 
we have X n +i = ■ • • = X n+fc _i = 0. (b) Prove that every fc-tuple (ai, a 2 ,.. •, afc) of 
elements with 0 < a 3 < m occurs in the sequence; hence the sequence terminates when 
n = m k . [Hint: Prove that the k -tuple (ai,..., a s , 0,..., 0) appears, when a s 7 ^ 0, 
by induction on 5 .] Note that if we now define /(X n ,..., X n +fc_ 1 ) = X n +k for 
1 < n < m k , setting X m fc_|_ fc = 0 , we obtain a function of maximum possible period. 

18. [ M22 ] Let (X„) be the sequence of bits generated by method (10), with k = 35 
and CONTENTS (A) = (00000000000000000000000000000000101) 2 . Let U n be the binary 
fraction (.X n kX n k+i ... X n fc+fc— 1 ) 2 ; show that this sequence (U n ) fails the serial test 
on pairs (Section 3.3.2B) when d = 8 . 

19. [M 41 ] For each prime p specified in the first column of Table 1 in Section 4.5.4, 
find suitable constants ai, a 2 as suggested in the text, such that the period length of 
( 8 ), when k = 2, is p 2 — 1 . (See Eq. 3.3.4-39 for an example.) 

20. [ M 40 ] Calculate constants suitable for use as CONTENTS (A) in method (10), having 
approximately the same number of zeros as ones, for 2 < k < 64. 

21 . [M35] (D. Rees.) The text explains how to find functions / such that the sequence 
( 11 ) has period length m k — 1 , provided that m is prime and Xo,..., Xjt —1 are not all 
zero. Show that such functions can be modified to obtain sequences of type ( 11 ) with 
period length m k , for all integers m. [Hints: Consider Lemma 3.2.1.2Q, the trick of 
exercise 7, and sequences such as (pX 2n -f-X 2 n +i).] 

► 22 . [M24] The text restricts discussion of the extended linear sequences ( 8 ) to the 
case that m is prime. Prove that reasonably long periods can also be obtained when m 
is “square-free,” i.e., the product of distinct primes. (Examination of Table 3.2.1.1-1 
shows that m = w ^ 1 often satisfies this hypothesis; many of the results of the text 
can therefore be carried over to that case, which is somewhat more convenient for 
calculation.) 

► 23. [20] Discuss the sequence defined by X n = (X n —55 — X n — 24 ) mod m as an 
alternative to (7). 

24. [M20] Let 0 < k < m. Prove that the sequence of bits defined by the recurrence 
X n = (X n _m+fc -b X n —m) mod 2 has period length 2 m — 1 whenever the sequence 
defined by Y n = ( Y n —k -f-Fn— m )mod2 does. 

25. [26] Discuss the alternative to Program A that changes all 55 entries of the Y 
table every 55th time a random number is required. 

26. [M 48 ] (J. F. Reiser.) Let p be prime and let k be a positive integer. Given integers 
ai,..., ak and x \let X a be the period of the sequence (X n ) generated by the 
recurrence 

X n — z n modp a , 0 < n < k; X n = (aiX n — H- \-dkX n — fc)modp Q , n > k; 

and let N a be the number of 0’s that occur in the period (i.e., the number of indices j 
such that (i a < j < (i a -\-\ a and X 3 = 0). Prove or disprove the following conjecture: 
There exists a constant c (depending possibly on p and k and ai,...,a/c) such that 
N a < cp Q(fc ~ 2)/(fc— for all a and all x \,..., Xk. 
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[Notes: Reiser has proved that if the recurrence has maximum period length mod p 
(i.e., if Xi = p k — 1), and if the conjecture holds, then the k- dimensional discrepancy 
of { X n ) will be 0(a k p~ a/( ' k ~ 1 ^) as a —► oo; thus an additive generator like (7) would 
be well distributed in 55 dimensions, when m — 2 e and the entire period is considered. 
(See Section 3.3.4 for the definition of discrepancy in k dimensions.) The conjecture 
is a very weak condition, for if ( X n ) takes on each value about equally often and if 
X Q — p ot ~ 1 (p k — 1), the quantity N a ~ {p k — 1 )/p does not grow at all as a increases. 
Reiser has verified the conjecture for k = 3. On the other hand he has shown that 
it is possible to find unusually bad starting values xi,..., Xk (depending on a) so that 
Nia > p a , provided that X a = p a ~ 1 (p k — 1) and k > 3 and a is sufficiently large.] 

27. [MSO] Suppose Algorithm B is being applied to a sequence (X n ) whose period 
length is X, where X > k. Show that for fixed k and all sufficiently large X, the output 
of the sequence will eventually be periodic with the same period length X, unless ( X n ) 
isn’t very random to start with. [Hint: Find a pattern of consecutive values of [kX n /m\ 
that causes Algorithm B to “synchronize” its subsequent behavior.] 

28. [40] (A. G. Waterman.) Experiment with linear congruential sequences with m the 
square or cube of the computer word size, while a and c are single-precision numbers. 
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3.3. STATISTICAL TESTS 


OUR MAIN PURPOSE is to obtain sequences that behave as if they are random. 
So far we have seen how to make the period of a sequence so long that for 
practical purposes it never will repeat; this is an important criterion, but it by 
no means guarantees that the sequence will be useful in applications. How then 
are we to decide whether a sequence is sufficiently random? 

If we were to give some man a pencil and paper and ask him to write down 
100 random decimal digits, chances are very slim that he will give a satisfactory 
result. People tend to avoid things that seem nonrandom, such as pairs of 
equal adjacent digits (although about one out of every 10 digits should equal 
its predecessor). And if we would show someone a table of truly random digits, 
he would quite probably tell us they are not random at all; his eye would spot 
certain apparent regularities. 

According to Dr. I. J. Matrix (as quoted by Martin Gardner in Scientific 
American, January, 1965), “Mathematicians consider the decimal expansion of 
7r a random series, but to a modern numerologist it is rich with remarkable 
patterns.” Dr. Matrix has pointed out, for example, that the first repeated 
two-digit number in 7r’s expansion is 26, and its second appearance comes in the 
middle of a curious repetition pattern: 


A X 

3.14159265358979323846264338327950 

VA^Y 




After listing a dozen or so further properties of these digits, he observed that 7r, 
when correctly interpreted, conveys the entire history of the human race! 

We all notice patterns in our telephone numbers, license numbers, etc., as 
aids to memory. The point of these remarks is that we cannot be trusted to judge 
by ourselves whether a sequence of numbers is random or not. Some unbiased 
mechanical tests must be applied. 

The theory of statistics provides us with some quantitative measures for 
randomness. There is literally no end to the number of tests that can be 
conceived; we will discuss those tests that have proved to be most useful, most 
instructive, and most readily adapted to computer calculation. 

If a sequence behaves randomly with respect to tests 7\, , ..., T n , we 

cannot be sure in general that it will not be a miserable failure when it is 
subjected to a further test T n +\\ yet each test gives us more and more confidence 
in the randomness of the sequence. In practice, we apply about half a dozen 
different kinds of statistical tests to a sequence, and if it passes these satisfactorily 
we consider it to be random—it is then presumed innocent until proven guilty. 

Every sequence that is to be used extensively should be tested carefully, 
so the following sections explain how to carry out these tests in the right way. 
Two kinds of tests are distinguished: empirical tests, for which the computer 
manipulates groups of numbers of the sequence and evaluates certain statistics; 
and theoretical tests, for which we establish characteristics of the sequence by 
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using number-theoretic methods based on the recurrence rule used to form the 
sequence. 

If the evidence doesn’t come out as desired, the reader may wish to try the 
techniques in How to Lie With Statistics by Darrell Huff (Norton, 1954). 


3.3.1. General Test Procedures for Studying Random Data 

A. “Chi-square” tests. The chi-square test (x 2 test) is perhaps the best known 
of all statistical tests, and it is a basic method that is used in connection with 
many other tests. Before considering the method in general, let us consider a 
particular example of the chi-square test as it might be applied to dice throwing. 
Using two “true” dice (each of which, independently, is assumed to register 1, 2, 
3, 4, 5, or 6 with^equal probability), the following table gives the probability of 
obtaining a given total, s, on a single throw: 

value of s= 2 3 4 5 6 7 

probability, Ps= i ^ i 5 i g 

(For example, a value of 4 can be thrown in three ways: 1 + 3, 2 + 2, 3 + 1; this 
constitutes ^ = + = p A of the 36 possible outcomes.) 

If we throw the dice n times, we should obtain the value s approximately 
np s times on the average. For example, in 144 throws we should get the value 4 
about 12 times. The following table shows what results were actually obtained 
in a particular sequence of 144 throws of the dice: 

value of s= 2 3 4 5 6 7 8 9 10 11 12 

observed number, Y s = 2 4 10 12 22 29 21 15 14 9 6 (2) 

expected number, np s = 4 8 12 16 20 24 20 16 12 8 4 

Note that the observed number is different from the expected number in all cases; 
in fact, random throws of the dice will hardly ever come out with exactly the 
right frequencies. There are 36 144 possible sequences of 144 throws, all of which 
are equally likely. One of these sequences consists of all 2’s (“snake eyes”), and 
anyone throwing 144 snake eyes in a row would be convinced that the dice were 
loaded. Yet the sequence of all 2’s is just as probable as any other particular 
sequence if we specify the outcome of each throw of each die. 

In view of this, how can we test whether or not a given pair of dice is loaded? 
The answer is that we can’t make a definite yes-no statement, but we can give 
a probabilistic answer. We can say how probable or improbable certain types of 
events are. 

A fairly natural way to proceed in the above example is to consider the 
squares of the differences between the observed numbers Y s and the expected 
numbers np s . We can add these together, obtaining 


8 9 10 11 12 


_ 5 _ 

36 


1 

9 


1 

12 


JL 

18 


_1_ 

36 


(i) 


V = (y 2 — np 2 f + (y 3 - np 3 ) 2 + • • • + (Yi 2 - np l2 f. 


( 3 ) 
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A bad set of dice should result in a relatively high value of V ; and for any given 
value of V we can ask, “What is the probability that V is this high, using true 
dice?” If this probability is very small, say we would know that only about 
one time in 100 would true dice give results so far away from the expected num¬ 
bers, and we would have definite grounds for suspicion. (Remember, however, 
that even good dice would give such a high value of V about one time in a 
hundred, so a cautious person would repeat the experiment to see if the high 
value of V is repeated.) 

The statistic V in (3) gives equal weight to (Y 7 — np 7 ) 2 and (Y a — np 2 ) 2 , 
although (Y 7 — np 7 ) 2 is likely to be a good deal higher than (Y 2 — np 2 ) 2 since 7’s 
occur about six times as often as 2’s. It turns out that the “right” statistic, at 
least one that has proved to be most important, will give (Y 7 — np 7 ) 2 only J as 
much weight as (Y 2 — np 2 ) 2 , and we should change (3) to the following formula: 


V = 


_ (Y 2 — np 2 ) 2 (Y 3 -np 3 y 

np 2 np 3 


+ •■■ + 


P'12 — np 12 ) 2 


np\ 2 


This is called the “chi-square” statistic of the observed quantities Y 2 ,.. .,Y 12 in 
this dice-throwing experiment. For the data in (2), we find that 


. (2 4) 2 (4 — 8) 2 

4 ~ l ~ 8 


(9 — 8) 2 (6 — 4) 2 7 

+ "' + _ 8 ~ + -“T~ =7 48 


The important question now is, of course, “does 7^ constitute an improbably 
high value for V to assume?” Before answering this question, let us consider the 
general application of the chi-square method. 

In general, suppose that every observation can fall into one of k categories. 
We take n independent observations; this means that the outcome of one obser¬ 
vation has absolutely no effect on the outcome of any of the others. Let p s be the 
probability that each observation falls into category s, and let Y $ be the number 
of observations that actually do fall into category s. We form the statistic 


£ 


(Y s — np s ) : 


l<s<k 

In our example above, there are eleven possible outcomes of each throw of the 
dice, so k = 11. (Eq. (6) is a slight change of notation from Eq. (4), since we 
are numbering the possibilities from 1 to A; instead of from 2 to 12.) 

By expanding (Y s — np s ) 2 — Y 2 ~2np s Y s -\-n 2 p 2 s in (6), and using the facts 

that 

Yi~\-Y 2 -\ - \~Yk = n, 

Pi + P2 H- \~Pk = 1, 

we arrive at the formula 


which often makes the computation of V somewhat easier. 
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Table 1 

SELECTED PERCENTAGE POINTS OF THE CHI-SQUARE DISTRIBUTION 



p=l% 

in 

II 

p = 25% 

p= 50% 

p = 75% 

p = 95% 

p = 99% 

IS = 1 

0.00016 

0.00393 

0.1015 

0.4549 

1.323 

3.841 

6.635 

is = 2 

0.02010 

0.1026 

0.5753 

1.386 

2.773 

5.991 

9.210 

is = 3 

0.1148 

0.3518 

1.213 

2.366 

4.108 

7.815 

11.34 

is = 4 

0.2971 

0.7107 

1.923 

3.357 

5.385 

9.488 

13.28 

is — 5 

0.5543 

1.1455 

2.675 

4.351 

6.626 

11.07 

15.09 

is = 6 

0.8720 

1.635 

3.455 

5.348 

7.841 

12.59 

16.81 

is = 7 

1.239 

2.167 

4.255 

6.346 

9.037 

14.07 

18.48 

is = 8 

1.646 

2.733 

5.071 

7.344 

10.22 

15.51 

20.09 

is — 9 

2.088 

3.325 

5.899 

8.343 

11.39 

16.92 

21.67 

II 

O 

2.558 

3.940 

6.737 

9.342 

12.55 

18.31 

23.21 

is = 11 

3.053 

4.575 

7.584 

10.34 

13.70 

19.68 

24.73 

is = 12 

3.571 

5.226 

8.438 

11.34 

14.84 

21.03 

26.22 

is = 15 

5.229 

7.261 

11.04 

14.34 

18.25 

25.00 

30.58 

IS = 20 

8.260 

10.85 

15.45 

19.34 

23.83 

31.41 

37.57 

tl 

CO 

0 

14.95 

18.49 

24.48 

29.34 

34.80 

43.77 

50.89 

II 

UT 

O 

29.71 

34.76 

42.94 

49.33 

56.33 

67.50 

76.15 

IS > 30 

is + V2ux p + \x\ — § + 0(l/vT) 

x p = 

—2.33 

—1.64 

—.675 

0.00 

0.675 

1.64 

2.33 


(For further values, see Handbook of Mathematical Functions, ed. by M. Abramowitz and I. 
A. Stegun (Washington, D.C.: U.S. Government Printing Office, 1964), Table 26.8.) 


Now we turn to the important question, what constitutes a reasonable value 
of V? This is found by referring to a table such as Table 1, which gives values of 
“the chi-square distribution with is degrees of freedom” for various values of is. 
The line of the table with is = k — 1 is to be used; the number of “degrees of 
freedom ” is k —1, one less than the number of categories. (Intuitively, this means 
that Y u Y 2t ..Yjb are not completely independent, since Eq. (7) shows that Y\ 
can be computed if Y 2 ,... ,Yk are known; hence, k — 1 degrees of freedom are 
present. This argument is not rigorous, but the theory below justifies it.) 

If the table entry in row is under column p is x, it means, “The quantity V 
in Eq. (8) will be less than or equal to x with approximate probability p, if n is 
large enough.” For example, the 95 percent entry in row 10 is 18.31; this says 
we will have V > 18.31 only about 5 percent of the time. 
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Let us assume that the above dice-throwing experiment is simulated on a 
computer using some sequence of supposedly random numbers, with the following 
results: 


value of s = 2 3 4 5 6 7 8 9 10 11 12 

Experiment 1, Y s = 4 10 10 13 20 18 18 11 13 14 13 (9) 

Experiment 2, Y s = 3 7 11 15 19 24 21 17 13 9 5 

We can compute the chi-square statistic in the first case, getting the value V\ = 

29and in the second case we get V 2 = l^Jj. Referring to the table entries 
for 10 degrees of freedom, we see that V\ is much too high ; V will be greater than 
23.21 only about one percent of the time! (By using more extensive tables, we 
find in fact that V will be as high as V\ only 0.1 percent of the time.) Therefore 
Experiment 1 represents a significant departure from random behavior. 

On the other hand, V 2 is quite low, since the observed values Y s in Experi¬ 
ment 2 are quite close to the expected values np s in (2). The chi-square table 
tells us, in fact, that V 2 is much too low: the observed values are so close to the 
expected values, we cannot consider the result to be random! (Indeed, reference 
to other tables shows that such a low value of V occurs only 0.03 percent of 
the time when there are 10 degrees of freedom.) Finally, the value V = 7jg 
computed in (5) can also be checked with Table 1. It falls between the entries 
for 25 percent and 50 percent, so we cannot consider it to be significantly high 
or significantly low; thus the observations in (2) are satisfactorily random with 
respect to this test. 

It is somewhat remarkable that the same table entries are used no matter 
what the value of n is, and no matter what the probabilities p s are. Only the 
number v — k — 1 affects the results. In actual fact, however, the table entries 
are not exactly correct: the chi-square distribution is an approximation that is 
valid only for large enough values of n. How large should n be? A common rule 
of thumb is to take n large enough so that each of the expected values np s is 
five or more; preferably, however, take n much larger than this, to get a more 
powerful test. Note that in our examples above we took n — 144, so np 2 was 
only 4, violating the stated “rule of thumb.” This was done only because the 
author tired of throwing the dice; it makes the entries in Table 1 less accurate 
for our application. Experiments run on a computer, with n = 1000, or 10000, 
or even 100000, would be much better than this. We could also combine the data 
for s = 2 and s = 12; then the test would have only nine degrees of freedom 
but the chi-square approximation would be more accurate. 

We can get an idea of how crude an approximation is involved by considering 
the case when there are only two categories, having respective probabilities pi 
and P 2 . Suppose p\ = \ and P 2 — §. According to the stated rule of thumb, 
we should have n > 20 to have a satisfactory approximation, so let’s check this 
out. When n = 20, the possible values of V are ^r 2 for —5 < r < 15; we 
wish to know how well the v = 1 row of Table 1 describes the distribution of V. 
The chi-square distribution varies continuously, while the actual distribution 
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of V has rather big jumps, so we need some convention for representing the 
exact distribution. If the distinct possible outcomes of the experiment lead to 
the values Vo < Vi < ••• < V n with respective probabilities 7r 0 , 7 Ti, ..., 7r n , 

suppose that a given percentage p falls in the range 7To • • • + 7Tj_i < p < 

7T 0 + • • • + Ttj —i -j- t Tj. We would like to represent p by a “percentage point” x 
such that V is less than x with probability < p and V is greater than x with 
probability < 1 — p. It is not difficult to see that the only such number is 
x — Vj. In our example for n — 20 and v = 1, it turns out that the percentage 
points of the exact distribution, corresponding to the approximations in Table 1 
for p = 1%, 5%, 25%, 50%, 75%, 95%, and 99%, respectively, are 

0, 0, .27, .27, 1.07, 4.27, 6.67. 

For example, the percentage point for p = 95% is 4.27, while Table 1 gives the 
estimate 3.841. The latter value is too low; it tells us (incorrectly) to reject the 
value V = 4.27 at the 95% level, while in fact the probability that V > 4.27 
is more than 6.5%. When n = 21, the situation changes slightly because the 
expected values npi = 5.25 and np 2 — 15.75 can never be obta’ 1 "^ exactly; the 
percentage points for n = 21 are 

.02, .02, .14, .40, 1.29, 3.57, 5.73. 

We would expect Table 1 to be a better approximation when n ~ 50, but 
the corresponding tableau actually turns out to be further from Table 1 in some 
respects than it was for n = 20: 

.03, .03, .03, .67, 1.31, 3.23, 6.0. 

Here are the values when n — 300: 

0, 0, .07, .44, 1.44, 4.0, 6.42. 

Even in this case, when np s is > 75 in each category, the entries in Table 1 are 
good to only about one significant digit. 

The proper choice of n is somewhat obscure. If the dice are actually biased, 
the fact will be detected as n gets larger and larger. (Cf. exercise 12.) But large 
values of n will tend to smooth out locally nonrandom behavior, i.e., blocks of 
numbers with a strong bias followed by blocks of numbers with the opposite bias. 
This type of behavior would not happen when actual dice are rolled, since the 
same dice are used throughout the test, but a sequence of numbers generated on 
a computer might very well display such locally nonrandom behavior. Perhaps 
a chi-square test should be made for a number of different values of n. At any 
rate, n should always be rather large. 

We can summarize the chi-square test as follows. A fairly large number, 
n, of independent observations is made. (It is important to avoid using the 
chi-square method unless the observations are independent. See, for example, 
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Range of V 


Indication 


Code 


c# 


0 — 1 percent, 99 — 100 percent Reject 

1—5 percent, 95 — 99 percent Suspect 

5 — 10 percent, 90 — 95 percent Almost suspect Q 

Fig. 2. Indications of “significant” deviations in 90 chi-square tests (cf. also Fig. 5). 


exercise 10, which considers the case when half of the observations depend on the 
other half.) We count the number of observations falling into each of k categories 
and compute the quantity V given in Eqs. (6) and (8). Then V is compared with 
the numbers in Table 1, with v = k — 1. If V is less than the 1% entry or 
greater than the 99% entry, we reject the numbers as not sufficiently random. If 
V lies between the 1% and 5% entries or between the 95% and 99% entries, the 
numbers are “suspect”; if (by interpolation in the table) V lies between the 5% 
and 10% entries, or the 90% and 95% entries, the numbers might be “almost 
suspect.” The chi-square test is often done at least three times on different sets 
of data, and if at least two of the three results are suspect the numbers are 
regarded as not sufficiently random. 

For example, see Fig. 2, which shows schematically the results of applying 
five different types of chi-square tests on each of six sequences of random num¬ 
bers. Each test has been applied to three different blocks of numbers of the 
sequence. Generator A is the MacLaren-Marsaglia method (Algorithm 3.2.2M 
applied to the sequences in 3.2.2-12), Generator E is the Fibonacci method, 
and the other generators are linear congruential sequences with the following 
parameters: 

Generator B: X 0 = 0, a = 3141592653, c = 2718281829, m = 2 35 . 

Generator C: X 0 = 0, a — 2 7 + 1 , c — 1 , m = 2 35 . 

Generator D: X 0 = 47594118, a = 23, c — 0, m = 10 8 + 1. 

Generator F: X 0 = 314159265, a = 2 18 + 1, c = 1, m = 2 35 . 

From Fig. 2 we conclude that (so far as these tests are concerned) Generators A, 
B, D are satisfactory, Generator C is on the borderline and should probably 
be rejected, Generators E and F are definitely unsatisfactory. Generator F 
has, of course, low potency; Generators C and D have been discussed in the 
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Fig. 3. Examples of distribution functions. 


literature, but their multipliers are too small. (Generator D is the original 
multiplicative generator proposed by Lehmer in 1948; Generator C is the original 
linear congruential generator with c^O proposed by Rotenberg in 1960.) 

Instead of using the “suspect,” “almost suspect,” etc., criteria for judging 
the results of chi-square tests, there is a less ad hoc procedure available, which 
will be discussed later in this section. 

B. The Kolmogorov-Smirnov test. As we have seen, the chi-square test applies 
to the situation when observations can fall into a finite number of categories. It is 
not unusual, however, to consider random quantities that may assume infinitely 
many values. For example, a random real number between 0 and 1 may take 
on infinitely many values; even though only a finite number of these can be 
represented in the computer, we want our random values to behave essentially 
as though though they are random real numbers. 

A general notation for specifying probability distributions, whether they are 
finite or infinite, is commonly used in the study of probability and statistics. 
Suppose we want to specify the distribution of the values of a random quantity, 
X; we do this in terms of the distribution function F(x), where 

F(x) = probability that (X < x). 

Three examples are shown in Fig. 3. First we see the distribution function for 
a random bit, i.e., for the case when X takes on only the two values 0 and 1, 
each with probability Part (b) of the figure shows the distribution function 
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for a uniformly distributed random real number between zero and one, so the 
probability that X < x is simply equal to x when 0 < x < 1; for example, 
the probability that X < § is, naturally, §. And part (c) shows the limiting 
distribution of the value V in the chi-square test (shown here with 10 degrees of 
freedom); this is a distribution that we have already seen represented in another 
way in Table 1. Note that F(x) always increases from 0 to 1 as x increases from 
—OO to -f*00. 

If we make n independent observations of the random quantity X, thereby 
obtaining the values X\, Xs, ..., X n , we can form the empirical distribution 
function F n (x), where 


number of X 1? X 2 ,..., X n that are < x 

F n (x) =-. (10) 

n 

Figure 4 illustrates three empirical distribution functions (shown as zigzag lines, 
although strictly speaking the vertical lines are not part of the graph of F n (x)), 
superimposed on a graph of the assumed actual distribution function F{x). As 
n gets large, F n (x) should be a better and better approximation to F(x). 

The Kolmogorov-Smirnov test (KS test) may be used when F(x) has no 
jumps. It is based on the difference between F(x) and F n (x). A bad source 
of random numbers will give empirical distribution functions that do not ap¬ 
proximate F(x) sufficiently well. Figure 4(b) shows an example in which the 
Xi are consistently too high, so the empirical distribution function is too low. 
Part (c) of the figure shows an even worse example; it is plain that such great 
deviations between F n (x) and F(x) are extremely improbable, and the KS test 
is used to tell us how improbable they are. 

To make the test, we form the following statistics: 

X+ = max (F n (x) — F(x)); 

— oo<i<+oo 

K~ — \fn max (F(x) — F n (x)). 

— oo < x < -j- oo v 


Here K+ measures the greatest amount of deviation when F n is greater than F, 
and K~ measures the maximum deviation when F n is less than F. The statistics 
for the examples of Fig. 4 are 

Part (a) Part (b) Part (c) 

K+ 0.492 0.134 0.313 , x 

20 (12 

Ky 0 0.536 1.027 2.101 v ' 

(Note: The factor y/n that appears in Eqs. (11) may seem puzzling at first. 
Exercise 6 shows that, for fixed x, the standard deviation of F n (x) is proportional 
to 1 /y/n\ hence the factor y/n magnifies the statistics KJ, K~ in such a way 
that this standard deviation is independent of n.) 
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Table 2 

SELECTED PERCENTAGE POINTS OF THE DISTRIBUTIONS K+ AND K~ 



p=l% 

£ 

10 

II 

p = 25% 

p = 50% 

p = 75% 

p = 95% 

p = 99% 

n = 1 

0.01000 

0.05000 

0.2500 

0.5000 

0.7500 

0.9500 



0.01400 

0.06749 

0.2929 

0.5176 

0.7071 

1.0980 

1.2728 


0.01699 

0.07919 

0.3112 

0.5147 

0.7539 

1.1017 

1.3589 


0.01943 

0.08789 

0.3202 

0.5110 

0.7642 

1.1304 

1.3777 

B9 

0.02152 

0.09471 

0.3249 

0.5245 

0.7674 

1.1392 

1.4024 

n — 6 

0.02336 

0.1002 

0.3272 

0.5319 

0.7703 

1.1463 


n = 7 

0.02501 

0.1048 

0.3280 

0.5364 

0.7755 

1.1537 

1.4246 

3 

II 

00 

0.02650 

0.1086 

0.3280 

0.5392 

0.7797 

1.1586 

1.4327 

n — 9 

0.02786 

0.1119 

0.3274 

0.5411 

0.7825 




0.02912 

0.1147 

0.3297 

0.5426 

0.7845 

1.1658 

1.4440 

n — 11 

0.03028 

0.1172 

0.3330 

0.5439 

0.7863 

1.1688 

1.4484 

n = 12 

0.03137 

0.1193 

0.3357 

0.5453 

0.7880 

1.1714 

1.4521 

n = 15 

0.03424 

0.1244 

0.3412 

0.5500 

0.7926 

1.1773 

1.4606 

7i = 20 

0.03807 

0.1298 

0.3461 

0.5547 

0.7975 

1.1839 

1.4698 

n = 30 

0.04354 

0.1351 

0.3509 

0.5605 

0.8036 

1.1916 

1.4801 

71 > 30 

y p — l/(Qy/n) +0(l/7i), where y\ = Jln(l/(1— p)) 

Vp — 

0.07089 

0.1601 

0.3793 

0.5887 

0.8326 

1.2239 

1.5174 


As in the chi-square test, we may now look up the values K+, K~ in a 
“percentile” table to determine if they are significantly high or low. Table 2 may 
be used for this purpose, both for K+ and K~. For example, the probability is 
75 percent that K^q will be 0.7975 or less. Unlike the chi-square test, the table 
entries are not merely approximations that hold for large values of n; Table 2 
gives exact values (except, of course, for roundoff error), and the KS test may 
be reliably used for any value of n. 

As they stand, formulas (11) are not readily adapted to computer calculation, 
since we are asking for a maximum over infinitely many values of x. From 
the fact that F(x) is increasing and the fact that F n (x) increases only in finite 
steps, however, we can derive a simple procedure for evaluating the statistics 
K+ and K~: 

Step 1. Obtain the observations X \, X 2 ,..., X n . 

Step 2. Rearrange the observations so that they are sorted into ascending order, 
i.e., so that Xi < X 2 < • • • < X n . (Efficient sorting algorithms are the subject 
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of Chapter 5. On the other hand, it is possible to avoid sorting in this particular 
application, as shown in exercise 23.) 

Step 3. The desired statistics are now given by the formulas 


Kt = y/n max 
l<J<n 




K„ — y/n max 

l<j'<n 



(13) 


An appropriate choice of the number of observations, n, is slightly easier to 
make for this test than it is for the x 2 test, although some of the considerations 
are similar. If the random variables X 3 actually belong to the probability 
distribution G(x), while they were assumed to belong to the distribution given 
by F(x), it will take a comparatively large value of n to reject the hypothesis 
that G(x) = F(x); for we need n large enough that the empirical distributions 
G n (x) and F n (x) are expected to be observably different. On the other hand, 
large values of n will tend to average out locally nonrandom behavior, and such 
behavior is an undesirable characteristic that is of significant importance in most 
computer applications of random numbers; this makes a case for smaller values 
of n. A good compromise would be to take n equal to, say, 1000, and to make 
a fairly large number of calculations of Kf QQ0 on different parts of a random 
sequence, thereby obtaining values 

^iooo(lh ^iooo(2)> • • • ? ^iooo( r )- (14) 

We can also apply the KS test again to these results: Let F(x) now be the 
distribution function for AT^ 000 , and determine the empirical distribution F r (x) 
obtained from the observed values in (14). Fortunately, the function F(x) in this 
case is very simple; for a large value of n like n = 1000, the distribution of A7+ 
is closely approximated by 

Foo(x) = 1 — e~ 2x2 , x > 0. (15) 

The same remarks apply to K~, since K+ and K~ have the same expected 
behavior. This method of using several tests for moderately large n, then 
combining the observations later in another KS test, will tend to detect both 
local and global nonrandom behavior. 

An experiment of this type (although on a much smaller scale) was made 
by the author as this chapter was being written. The “maximum of 5” test 
described in the next section was applied to a set of 1000 uniform random 
numbers, yielding 200 observations X iy X 2 , ..., X 2 oo that were supposed to 
belong to the distribution F(x) = x 5 (0 < x < 1). The observations were 
divided into 20 groups of 10 each, and the statistic Kf 0 was computed for each 
group. The 20 values of K^ 0 , thus obtained, led to the empirical distributions 
shown in Fig. 4. The smooth curve shown in each of the diagrams in Fig. 4 
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is the actual distribution the statistic K^ 0 should have. Figure 4(a) shows the 
empirical distribution of obtained from the sequence 

Y n+1 = (3141592653T n + 2718281829) mod 2 35 , U n = Y n / 2 35 , 

and it is satisfactorily random. Part (b) of the figure came from the Fibonacci 
method; this sequence has globally nonrandom behavior, i.e., it can be shown 
that the observations X n in the “maximum of 5” test do not have the correct 
distribution F(x) = x 5 . Part (c) came from the notorious and impotent linear 
congruential sequence Y n +i = ((2 18 -(- 1 )Y n -f- l)mod2 35 , U n = y„/2 35 . 

The KS test applied to the data in Fig. 4 gives the results shown in (12). 
Referring to Table 2 for n = 20, we see that the values of K£ 0 and K^q for 
Fig. 4(b) are almost suspect (they lie at about the 5 percent and 88 percent 
levels) but not quite bad enough to be rejected outright. The value of Ky 0 for 
Part (c) is, of course, completely out of line, so the “maximum of 5” test shows 
a definite failure of that random number generator. 

We would expect the KS test in this experiment to have more difficulty 
locating global nonrandomness than local nonrandomness, since the basic obser¬ 
vations in Fig. 4 were made on samples of only 10 numbers each. If we were 
to take 20 groups of 1000 numbers each, part (b) would show a much more 
significant deviation. To illustrate this point, a single KS test was applied to 
all 200 of the observations that led to Fig. 4, and the following results were 
obtained: 

Part (a) Part (b) Part (c) 

K+ 0.477 1.537 2.819 , N 

200 16) 
K~ 2 oo 0.817 0.194 0.058 

The global nonrandomness of the Fibonacci generator has definitely been de¬ 
tected here. 

We may summarize the Kolmogorov-Smirnov test as follows. We are given 
n independent observations X \, ..., X n taken from some distribution specified 
by a continuous function F(x). That is, F(x) must be like the functions shown 
in Fig. 3(b) and 3(c), having no jumps like those in Fig. 3(a). The procedure 
explained just before Eqs. (13) is carried out on these observations, so we obtain 
the statistics K+ and K~. These statistics should be distributed according to 
Table 2. 

Some comparisons between the KS test and the x 2 test can now be made. 
In the first place, we should observe that the KS test may be used in conjunction 
with the x 2 test, to give a better procedure than the ad hoc method we mentioned 
when summarizing the x 2 test. (That is, there is a better way to proceed than 
to make three tests and to consider how many of the results were “suspect”.) 
Suppose we have made, say, 10 independent x 2 tests on different parts of a 
random sequence, so that values Vi, V 2 , ..., teio have been obtained. It is not a 
good policy simply to count how many of the V's are suspiciously large or small. 
This procedure will work in extreme cases, and very large or very small values 
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may mean that the sequence has too much local nonrandomness; but a better 
general method would be to plot the empirical distribution of these 10 values and 
to compare it to the correct distribution, which may be obtained from Table 1. 
This would give a clearer picture of the results of the x 2 tests, and in fact the 
statistics and could be determined as an indication of the success or 
failure. With only 10 values or even as many as 100 this could all be done easily 
by hand, using graphical methods; with a larger number of K’s, a computer 
subroutine for calculating the chi-square distribution would be necessary. Notice 
that all 20 of the observations in Fig. 4(c) fall between the 5 and 95 percent 
levels, so we would not have regarded any of them as suspicious, individually; 
yet collectively the empirical distribution shows that these observations are not 
at all right. 

An important difference between the KS test and the chi-square test is that 
the KS test applies to distributions F(x) having no jumps, while the chi-square 
test applies to distributions having nothing but jumps (since all observations are 
divided into k categories). The two tests are thus intended for different sorts of 
applications. Yet it is possible to apply the x 2 test even when F(x ) is continuous, 
if we divide the domain of F(x) into k parts and ignore all variations within each 
part. For example, if we want to test whether or not U\, U 2 , ..., U n can be 
considered to come from the uniform distribution between zero and one, we want 
to test if they have the distribution F(x ) = x for 0 < x < 1. This is a natural 
application for the KS test. But we might also divide up the interval from 0 to 1 
into k = 100 equal parts, count how many IT s fall into each part, and apply 
the chi-square test with 99 degrees of freedom. There are not many theoretical 
results available at the present time to compare the effectiveness of the KS test 
versus the chi-square test. The author has found some examples in which the 
KS test pointed out nonrandomness more clearly than the x 2 test? and others 
in which the x 2 test gave a more significant result. If, for example, the 100 
categories mentioned above are numbered 0, 1, ..., 99, and if the deviations 
from the expected values are positive in compartments 0 to 49 but negative in 
compartments 50 to 99, then the empirical distribution function will be much 
further from F(x) than the x 2 value would indicate; but if the positive deviations 
occur in compartments 0, 2, ..., 98 and the negative ones occur in 1, 3, ..., 99, 
the empirical distribution function will tend to hug F(x) much more closely. 
The kinds of deviations measured are therefore somewhat different. A x 2 test 
was applied to the 200 observations that led to Fig. 4, with k = 10, and the 
respective values of V were 9.4, 17.7, and 39.3; so in this particular case the 
values are quite comparable to the KS values given in (16). Since the x 2 test is 
intrinsically less accurate, and since it requires comparatively large values of n, 
the KS test has several advantages when a continuous distribution is to be tested. 

A further example will also be of interest. The data that led to Fig. 2 were 
chi-square statistics based on n — 200 observations of the “maximum-of-i ” 
criterion for 1 < t < 5, with the range divided into 10 equally probable parts. 
KS statistics ATJ 00 and can be computed from exactly the same sets of 200 
observations, and the results can be tabulated in just the same way as we did 
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Fig. 5. The KS tests applied to the same data as Fig. 2. 


in Fig. 2 (showing which KS values are beyond the 99-percent level, etc.); the 
results in this case are shown in Fig. 5. Note that Generator D (Lehmer’s original 
method) shows up very badly in Fig. 5, while chi-square tests on the very same 
data revealed no difficulty in Fig. 2; contrariwise, Generator E (the Fibonacci 
method) does not look so bad in Fig. 5. The good generators, A and B, passed all 
tests satisfactorily. The reasons for the discrepancies between Fig. 2 and Fig. 5 
are primarily that (a) the number of observations, 200, is really not large enough 
for a powerful test, and (b) the “reject,” “suspect,” “almost suspect” ranking 
criterion is itself suspect. 

(Incidentally, it is not fair to blame Lehmer for using a “bad” random 
number generator in the 1940s, since his actual use of Generator D was quite 
valid. The ENIAC computer was a highly parallel machine, programmed by 
means of a plugboard; Lehmer set it up so that one of its accumulators was 
repeatedly multiplying its own contents by 23, mod (10 8 -(-l), yielding a new value 
every few microseconds. Since this multiplier 23 is too small, we know that each 
value obtained by such a process was too strongly related to the preceding value 
to be considered sufficiently random; but the durations of time between actual 
uses of the values in the special accumulator by the accompanying program were 
comparatively long and subject to some fluctuation. So the effective multiplier 
was 23 k for large, varying values of k.) 

C. History, bibliography, and theory. The chi-square test was introduced by 
Karl Pearson in 1900 [Philosophical Magazine, Series 5, 50, 157-175]. Pearson’s 
important paper is regarded as one of the foundations of modern statistics, since 
before that time people would simply plot experimental results graphically and 
assert that they were correct. In his paper, Pearson gave several interesting 
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examples of the previous misuse of statistics; and he also proved that certain 
runs at roulette (which he had experienced during two weeks at Monte Carlo in 
1892) were so far from the expected frequencies that odds against the assumption 
of an honest wheel were some 10 29 to one! A general discussion of the chi-square 
test and an extensive bibliography appear in the survey article by William G. 
Cochran, Annals Math. Stat. 23 (1952), 315-345. 

Let us now consider a brief derivation of the theory behind the chi-square 
test. The exact probability that Yi = y \,..., Yk = yk is easily seen to be 


nl 




pV 


p y k' 


(17) 


If we assume that Y s has the value y s with the Poisson probability 


e nps (np s ) Vs 


Vs'- 

and that the Y’s are independent, then (Yj,..., Yfc) will equal (yi ,..., yk) with 
probability 

e~ nPs (np s ) Vs 


n 


y s '- 


1 < s < k 

and Yi —|-+ Yfc will equal n with probability 

e~ np3 (np s ) ys 


E 


n 


I/i H- \-Vk=n 1 <s<k 

yi,...,yk>0 


y s ! 


e~ n n n 


n\ 


If we assume that they are independent except for the condition Y\ -\ - \-Yk = 

n, the probability that (Yi,..., YJt) = ( 2 / 1 ,..., ?/fc) is the quotient 


n 

l<s<fc 



1 


which equals (17). We may therefore regard the Y’s as independently Poisson 
distributed, except for the fact that they have a fixed sum. 

It is convenient to make a change of variables, 


Y s — np s 



so that V = Z\ -f- V Z\. The condition Y\ -f-- \- Yk = n is equivalent to 

requiring that 

VP~iZ 1 + -- + Vp-kZ k = 0. (19) 

Let us consider the (k — l)-dimensional space S of all vectors (Zi,..., Zk) 
such that (19) holds. For large values of n, each Z s has approximately the 
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normal distribution (cf. exercise 1.2.10-16); therefore points in a differential 
volume dz 2 ... dzt c of S occur with probability approximately proportional to 

exp(— [z\ -|- \-z\)/2). (It is at this point in the derivation that the chi- 

square method becomes only an approximation for large n.) The probability 
that V < v is now 

f(z x ,...,z k ) in S and z x -\ -1 -z k <v ex P ( ( 2 1 “t - ' ' ' H - 2 k)/^) ^ 2 2 • • • dz k 

in s ex P(—(*?H- \- zl)/2)dz 2 ...dz k 

Since the hyperplane (19) passes through the origin of /c-dimensional space, the 
numerator in (20) is an integration over the interior of a (k — l)-dimensional 
hypersphere centered at the origin. An appropriate transformation to generalized 
polar coordinates with radius x an d angles uj i, ..., oj k —2 transforms (20) into 

f x 2 <v e ~ x ^ 2 X k ~ 2 fi UJ 1? • • • ,^k— 2 )dxdui . ..dUk—2 
f e-x 2 /2 x k ~ 2 f(u 1 oJk- 2 ) dxdu 1 ... du k _ 2 

for some function / (see exercise 15); then integration over the angles uj\, ..., 
Uk —2 gives a constant factor that cancels from numerator and denominator. We 
finally obtain the formula 


e * 2 /y 2 dx 

e~X 2 llyk —2 dx 

for the approximate probability that V < v. 

The above derivation uses the symbol x to stand for the radial length, 
just as Pearson did in his original paper; this is how the x 2 test got its name. 
Substituting t — x 2 / 2, the integrals can be expressed in terms of the incomplete 
gamma function, which we discussed in Section 1.2.11.3: 

lim probability that (V < v) = ~f( --, / t( -- 

n—KX> X 2 2)1 X 2 

This is the definition of the chi-square distribution with k — 1 degrees of freedom. 

We now turn to the KS test. In 1933, A. N. Kolmogorov proposed a test 
based on the statistic 

K n = y/n max | F n (x) — F(z)| = max(A+, K~). (23) 

—oo<z<-|-oo 

N. V. Smirnov gave several modifications of this test in 1939, including the in¬ 
dividual examination of K+ and K~ as we have suggested above. There is a 
large family of similar tests, but the K+ and K~ statistics seem to be most 
convenient for computer application. A comprehensive review of the literature 
concerning KS tests and their generalizations, including an extensive bibliog¬ 
raphy, appears in a monograph by J. Durbin, Regional Conf. Series on Applied 
Math. 9 (SIAM, 1973). 







3 . 3.1 


GENERAL TEST PROCEDURES 


55 


To study the distribution of K+ and K~, we begin with the following basic 
fact: IfX is a random variable with the continuous distribution F(x), then F(X) 
is a uniformly distributed real number between 0 and 1. To prove this, we need 
only verify that if 0 < y < 1 we have F(X) < y with probability y. Since F is 
continuous, -F(zo) = V for some xoi thus the probability that F(X) < y is the 
probability that X < xo- By definition, the latter probability is F(x o), that is, 
it is y. 

Let Yj = nF(Xj), for 1 < j < n, where the X’s have been sorted as in 
Step 2 above. Then the variables Yj are essentially the same as independent, 
uniformly distributed random numbers between 0 and 1 that have been sorted 
into nondecreasing order, Y\ < Y 2 < • • • < Y n ; and the first equation of (13) 
may be transformed into 

K+ = -i-max(l —Yi,2 — Y 2 ,...,n — Y n ). 
y/n 


If 0 < t < n, the probability that K/ < t /y/n is therefore the probability that 
Yj > j — t for 1 < j < n. This is not hard to express in terms of n-dimensional 
integrals, 


Jo d Vn ft dy n _i 



where ctj = ma x(j — t, 0). (24) 


The denominator here is immediately evaluated: it is found to be n n /n!, which 
makes sense since the hypercube of all vectors (yi, y2, . • •, y n ) with 0 < yj < n 
has volume n 71 , and it can be divided into n\ equal parts corresponding to each 
possible ordering of the y' s. The integral in the numerator is a little more 
difficult, but it yields to the attack suggested in exercise 17, and we get the 
general formula 


probability that 


K + < 

n — 



t 

n n 


Y (tV — *)*(*+ n — k ) 

0 <k<t 


(25) 

The distribution of K n is exactly the same. Equation (25) was first obtained by 
Z. W. Birnbaum and Fred H. Tingey [Annals Math. Stat. 22 (1951), 592-596]; it 
may be used to extend Table 2. 

In his original paper, Smirnov proved that 


lim probability that (K+ < s) = 1 — e 2s , if s > 0. (26) 

n—*• oo 


This together with (25) implies that, for all s > 0, we have 
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The more precise asymptotic formulas in Table 2 follow from results obtained by 
D. A. Darling [Theory of Prob. and Appl. 5 (1960), 356-361], who proved among 
other things that K+ < s with probability 

1 — e -2s2 (l — Is/y/n + 0(l/n)) (28) 


for all fixed s > 0. 


EXERCISES 

1. [00] What line of the chi-square table should be used to check whether or not the 
value V = 7 ^ of Eq. (5) is improbably high? 

2. [20] If two dice are “loaded” so that, on one die, the value 1 will turn up exactly 
twice as often as any of the other values, and the other die is similarly biased towards 
6, compute the probability p s that a total of exactly s will appear on the two dice, for 
2 < s < 12. 

► 3. [23] Some dice that were loaded as described in the previous exercise were rolled 
144 times, and the following values were observed: 

value of s = 2 3 4 5 6 7 8 9 10 11 12 

observed number, Y s = 2 6 10 16 18 32 20 13 16 9 2 

Apply the chi-square test to these values, using the probabilities in (1), pretending it 
is not known that the dice are in fact faulty. Does the chi-square test detect the bad 
dice? If not, explain why not. 

► 4. [28] The author actually obtained the data in experiment 1 of (9) by simulating 

dice in which one was normal, the other was loaded so that it always turned up 1 or 6. 

(The latter two possibilities were equally probable.) Compute the probabilities that 
replace (1) in this case, and by using a chi-square test decide if the results of that 
experiment are consistent with the dice being loaded in this way. 

5. [22] Let F(x) be the uniform distribution, Fig. 3(b). Find K£ 0 and K^ 0 for the 
following 20 observations: 

0.414, 0.732, 0.236, 0.162, 0.259, 0.442, 0.189, 0.693, 0.098, 0.302, 

0.442, 0.434, 0.141, 0.017, 0.318, 0.869, 0.772, 0.678, 0.354, 0.718, 

and state whether these observations are significantly different from expected behavior 
with respect to either of these two tests. 

6 . [M20] Consider F n (x), as given in Eq. (10), for fixed x. What is the probability 
that F n (x) = s/n, given an integer s? What is the mean value of F n {x)? What is the 
standard deviation? 

7. [M15] Show that K+ and K~ can never be negative. What is the largest possible 
value K+ can be? 
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8. [00] The text describes an experiment in which 20 values of the statistic K ^ 0 
were obtained in the study of a random sequence. These values were plotted, to obtain 
Fig. 4, and a KS statistic was computed from the resulting graph. Why were the table 
entries for n = 20 used to study the resulting statistic, instead of the table entries for 
n - 10? 

► 9. [20] The experiment described in the text consisted of plotting 20 values of 
Kto, computed from the “maximum of 5” test applied to different parts of a random 
sequence. We could have computed also the corresponding 20 values of K^o', since K ^ 0 
has the same distribution as Kf 0 , we could lump together the 40 values thus obtained 
(that is, 20 of the K^q s and 20 of the K u 0 ’s), and a KS test could be applied so that 
we would get new values K^ 0 , K^ 0 . Discuss the merits of this idea. 

► 10. [ 20 ] Suppose a chi-square test is done by making n observations, and the value V 
is obtained. Now we repeat the test on these same n observations over again (getting, 
of course, the same results), and we put together the data from both tests, regarding 
it as a single chi-square test with 2 n observations. (This procedure violates the text’s 
stipulation that all of the observations must be independent of one another.) How is 
the second value of V related to the first one? 

11. [10] Solve exercise 10 substituting the KS test for the chi-square test. 

12. [M28] Suppose a chi-square test is made on a set of n observations, assuming that 
p s is the probability that each observation falls into category s; but suppose that in 
actual fact the observations have probability q s p s of falling into category s. (Cf. 
exercise 3.) We would, of course, like the chi-square test to detect the fact that the p s 
assumption was incorrect. Show that this will happen, if n is large enough. Prove also 
the analogous result for the KS test. 

13. [M24] Prove that Eqs. (13) are equivalent to Eqs. (11). 

► 14. [HM26] Let Z s be given by Eq. (18). Show directly by using Stirling’s approxima¬ 
tion that the multinomial probability 

n\pX l ... p£ k /Y\ ! ... Yfc! = e~ v/2 /\J( 2 nir) k ~ 1 p l ...p k -f 0 (n~ k/2 ), 

if Z 1 , Z 2 ,..., Zk are bounded as n —> 00 . (This idea leads to a proof of the chi-square 
test that is much closer to “first principles,” and requires less handwaving, than the 
derivation in the text.) 

15. [HM24] Polar coordinates in two dimensions are conventionally defined by the 
equations x — r cos 0 and y ~ r sin 6 . For the purposes of integration, we have dx dy = 
r dr d9. More generally, in n-dimensional space we can let 

Xfc = rsin^i... sin0fc_i cos^fc, 1 < k < n, x n = r sin 65 ... sin0 n _i. 


Show that in this case 


dx 1 dx 2 ... dx n = r n 1 sin n 2 61 ... sin 8 n — 2 dr dQ\... dQ n —1 
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► 16. [HM35] Generalize Theorem 1.2.11.3A to find the value of 

7 (x -f 1, x + zV2x -f y)/T{x -f 1), 

for large x and fixed y, z. Disregard terms of the answer that are 0(l/x). Use this 
result to find the approximate solution, t, to the equation 



for large v and fixed p, thereby accounting for the asymptotic formulas indicated in 
Table 1. [Hint: See exercise 1.2.11.3-8.] 

17. [. HM26} Let t be a fixed real number. For 0 < k < n, let 

C x f x n /*Zfc-f2 f x k +1 r*2 

Pnk{x)= / dx n / dxn—i ... / dxt+i / dxk ... j dx i; 

J n—t J n —1 —t j k + \— t j 0 Jo 

by convention, let Poo(x) = 1. Prove the following relations: 

rx-\-t px n /*lfc + 2 f x k +1 rx 2 

a) Pnk{x) — I dx n / dXn — 1 ■■■ I dXk +1 I dxk ... / dx 1 . 

Jn Jn —1 ./fc+1 Jt Jt 

b) P nQ (x) = {x + t) n /n\ - {x + t) n ~ l /{n - 1)!. 

(k _ 

C) Pnk{x) ~ Pn(k- l)(z) = -—- P( n -k)o{x ~ k), if 1 < k < U. 

d) Obtain a general formula for P n k{x), and apply it to the evaluation of Eq. (24). 

18. [ M20] Give a “simple” reason why K~ has the same probability distribution as 

Kt- 

19. [ HM 48 ] Develop tests, analogous to the Kolomogrov-Smirnov test, for use with 
multivariate distributions F(x i,..., x r ) — probability that (Xi < xi,...,X r < x r ). 
(Such procedures could be used, for example, in place of the “serial test” in the next 
section.) 

20. [ HM41} Deduce further terms of the asymptotic behavior of the KS distribution, 
extending (28). 

21. [M40] Although the text states that the KS test should be applied only when 
F(x) is a continuous distribution function, it is, of course, possible to try to compute 

and K~ even when the distribution has jumps. Analyze the probable behavior of 
Kn and K~ for various discontinuous distributions F(x). Compare the effectiveness 
of the resulting statistical test with the chi-square test on several samples of random 
numbers. 

22. [ HM 46 ] Investigate the “improved” KS test suggested in the answer to exercise 6. 

23. [ M 22 ] (T. Gonzalez, S. Sahni, and W. R. Franta.) (a) Suppose that the maxi¬ 
mum value in formula (13) for the KS statistic K+ occurs at a given index j where 
[nF(Xj)\ = k. Prove that F(Xj) = maxi<;< n {F(Xi) | [nF(X t )J ~ k}. (b) Design 
an algorithm that calculates K+ and K~ in O(n) steps (without sorting). 
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► 24. [40] Experiment with various probability distributions (p, q,r) on three categories, 
where p-\-q-\-r = 1, by computing the exact distribution of the chi-square statistic V 
for various n, thereby determining how accurate an approximation the chi-square 
distribution with two degrees of freedom really is. 


3.3.2. Empirical Tests 

In this section we shall discuss ten kinds of specific tests that have been 
applied to sequences in order to investigate their randomness. The discussion of 
each test has two parts: (a) a “plug-in” description of how to perform the test; 
and (b) a study of the theoretical basis for the test. (Readers lacking mathe¬ 
matical training may wish to skip over the theoretical discussions. Conversely, 
mathematically-inclined readers may find the associated theory quite interest¬ 
ing, even if they never intend to test random number generators, since some 
instructive combinatorial questions are involved here.) 

Each test is applied to a sequence 


(U n ) = U 0 ,U 1 ,U 2 ,... (1) 

of real numbers, which purports to be independently and uniformly distributed 
between zero and one. Some of the tests are designed primarily for integer-valued 
sequences, instead of the real-valued sequence (1). In this case, the auxiliary 
sequence 

(Y n ) = Y 0 ,Y u Y 2l ..., (2) 

which is defined by the rule 

Y n = [dU n J, (3) 

is used instead. This is a sequence of integers that purports to be independently 
and uniformly distributed between 0 and d — 1. The number d is chosen for 
convenience; for example, we might have d = 64 == 2 6 on a binary computer, 
so that Y n represents the six most significant bits of the binary representation 
of U n . The value of d should be large enough so that the test is meaningful, but 
not so large that the test becomes impracticably difficult to carry out. 

The quantities U n , Y n , and d will have the above significance throughout 
this section, although the value of d will probably be different in different tests. 

A. Equidistribution test (Frequency test). The first requirement that sequence 
(1) must meet is that its numbers are, in fact, uniformly distributed between 
zero and one. There are two ways to make this test: (a) Use the Kolmogorov- 
Smirnov test, with F(x) = x for 0 < x < 1. (b) Let d be a convenient number, 
e.g., 100 on a decimal computer, 64 or 128 on a binary computer, and use the 
sequence (2) instead of (1). For each integer r, 0 < r < d, count the number 
of times that Y 3 = r for 0 < j < n, and then apply the chi-square test using 
k — d and probability p s = 1/d for each category. 

The theory behind this test has been covered in Section 3.3.1. 




60 


RANDOM NUMBERS 


3 . 3.2 


The equanimity of your average tosser of coins 
depends upon a law . . . which ensures that 
he will not upset himself by losing too much 
nor upset his opponent by winning too often. 

—TOM STOPPARD, Rosencrantz & Guildenstern are Dead { 1966) 


B. Serial test. More generally, we want pairs of successive numbers to be 
uniformly distributed in an independent manner. The sun comes up just about 
as often as it goes down, in the long run, but this doesn’t make its motion 
random. 

To carry out the serial test, we simply count the number of times that the 
pair (Y 2ji Y 2 j+i) — (q, r) occurs, for 0 < j < n; these counts are to be made for 
each pair of integers (q, r) with 0 < q, r < d, and the chi-square test is applied 
to these k = d 2 categories with probability 1/d 2 in each category. As with the 
equidistribution test, d may be any convenient number, but it will be somewhat 
smaller than the values suggested above since a valid chi-square test should have 
n large compared to k (say n > 5d 2 at least). 

Clearly we can generalize this test to triples, quadruples, etc., instead of 
pairs (see exercise 2); however, the value of d must then be severely reduced in 
order to avoid having too many categories. When quadruples and larger numbers 
of adjacent elements are considered, we therefore make use of less exact tests 
such as the poker test or the maximum test described below. 

Note that 2 n numbers of the sequence (2) are used in this test in order 
to make n observations. It would be a mistake to perform the serial test on 
the pairs (Yo,Yi), (Yi,y 2 ), • •., (Y n —i,Y n y, can the reader see why? We might 
perform another serial test on the pairs (y 2 j+i,^ 2 j-t- 2 )) and expect the sequence 
to pass both tests. Alternatively, I. J. Good has proved that if d is prime, and 
if the pairs (Yo,yi), (Yi, Y 2 ), ..., (Y n _i,Y n ) are used, and if we use the usual 
chi-square method to compute both the statistics V 2 for the serial test and Vi 
for the frequency test on Y 0: ..., Y n _i with the same value of d, then V 2 — 2Vi 
should have the chi-square distribution with (d — l) 2 degrees of freedom when 
n is large. (See Proc. Cambridge Phil. Soc . 49 (1953), 276-284; Annals Math. 
Stat. 28 (1957), 262-264.) 

C. Gap test. Another test is used to examine the length of “gaps” between 
occurrences of U 3 in a certain range. If a and (3 are two real numbers with 
0 < a < /? < 1, we want to consider the lengths of consecutive subsequences 
Uj, Uj+ 1 , ..., Uj+r in which Uj+ r lies between a and j3 but the other C/’s do 
not. (This subsequence of r -j- 1 numbers represents a gap of length r.) 

Algorithm G (Data for gap test). The following algorithm, applied to the sequence 
(1) for any given values of a and /?, counts the number of gaps of lengths 0, 1, ..., 
t — 1 and the number of gaps of length > t, until n gaps have been tabulated. 
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Fig. 6. Gathering data for the gap test. (Algorithms for the “coupon-collector’s test” 
and the “run test” are similar.) 

Gl. [Initialize.] Set j < -1, s <— 0, and set COUNT [r] <— 0 for 0 < r < t. 

G2. [Set r zero.] Set r *- 0. 

G3. [a < Uj < /3?\ Increase j by 1. If U 3 > a and Uj < (3, go to step G5. 

G4. [Increase r] Increase r by one, and return to step G3. 

G5. [Record gap length.] (A gap of length r has now been found.) If r > t, 
increase COUNT [t] by one, otherwise increase COUNT [r] by one. 

G6. [n gaps found?] Increase s by one. If s < n, return to step G2. | 

After this algorithm has been performed, the chi-square test is applied to 
the k = t + 1 values of C0UNT[0], C0UNT[1], ..., C0UNT[i], using the following 
probabilities: 


P0=P, Pi =P{1—P), P2=P{1~P) 2 , 

Pt-i = P( 1 — Pt = (1 — Pf • (4) 

Here p = (3 — a, the probability that a < U 3 < j3. The values of n and t are 
to be chosen, as usual, so that each of the values of COUNT [r] is expected to be 5 
or more, preferably more. „■» 

The gap test is often applied with a = 0 or (3 — 1 in order to omit one of 
the comparisons in step G3. The special cases (a, {3) = (0, J) or (J, 1) give rise 
to tests that are sometimes called “runs above the mean” and “runs below the 
mean,” respectively. 

The probabilities in Eq. (4) are easily deduced, so this derivation is left to 
the reader. Note that the gap test as described above observes the lengths of n 
gaps ; it does not observe the gap lengths among n numbers. If the sequence {U n ) 
is sufficiently nonrandom, Algorithm G may not terminate. Other gap tests that 
examine a fixed number of U’ s have also been proposed (see exercise 5). 
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D. Poker test (Partition test). The “classical” poker test considers n groups of 
five successive integers, (Y 5 j, ... ,Y 5 j+ 4 ) for 0 < j < n, and observes 

which of the following seven patterns is matched by each quintuple: 


All different: 

abode 

Full house: 

aaabb 

One pair: 

aabcd 

Four of a kind: 

aaaab 

Two Pairs: 

aabbc 

Five of a kind: 

aaaaa 

Three of a kind: 

aaabc 




A chi-square test is based on the number of quintuples in each category. 

It is reasonable to ask for a somewhat simpler version of this test, to facilitate 
the programming involved. A good compromise would simply be to count the 
number of distinct values in the set of five. We would then have five categories: 

5 different = all different; 

4 different = one pair; 

3 different = two pairs, or three of a kind; 

2 different = full house, or four of a kind; 

1 different = five of a kind. 

This breakdown is easier to determine systematically, and the test is nearly as 
good. 

In general we can consider n groups of k successive numbers, and we can 
count the number of fc-tuples with r different values. A chi-square test is then 
made, using the probability 

_ d{d — 1).. .(d — r + 1) fk\ 

Pr ~ 1* M 

that there are r different. (The Stirling numbers {J} are defined in Section 1.2.6, 
and they can readily be computed using the formulas given there.) Since the 
probability p r is very small when r — 1 or 2, we generally lump a few categories 
of low probability together before the chi-square test is applied. 

To derive the proper formula for p T , we must count how many of the d k 
A;-tuples of numbers between 0 and d — 1 have exactly r different elements, and 
divide the total by d k . Since d(d — 1)... (d — r -f- 1) is the number of ordered 
choices of r things from a set of d objects, we need only show that {^} is the 
number of ways to partition a set of k elements into exactly r parts. Therefore 
exercise 1.2.6-64 completes the derivation of Eq. (5). 

E. Coupon collector’s test. This test is related to the poker test somewhat as 
the gap test is related to the frequency test. The sequence Y 0 , Yi, ... is used, 
and we observe the lengths of segments Yj+ 1 , Yj + 2 , ..., Yj+ r required to get a 
“complete set” of integers from 0 to d — 1. Algorithm C describes this precisely: 
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Algorithm C (Data for coupon collector's test). Given a sequence of integers Yq, 
Yi, ..., with 0 < Yj < d, this algorithm counts the lengths of n consecutive 
“coupon collector” segments. At the conclusion of the algorithm, COUNT [r] is the 
number of segments with length r, for d < r < t, and COUNT [t] is the number of 
segments with length > t. 

Cl. [Initialize.] Set j < -1, s <— 0, and set C0UNT[r] <— 0 for d < r < t. 

C2. [Set q,r zero.] Set q «— r <— 0, and set 0CCURS[/c] <— 0 for 0 < k < d. 

C3. [Next observation.] Increase r and j by 1. If OCCURS [Yj] ^ 0, repeat this 
step. 

C4. [Complete set?] Set OCCURS [Y)] <— 1 and q q -\- 1. (The subsequence 
observed so far contains q distinct values; if q = d, we therefore have a 
complete set.) If q < d , return to step C3. 

C5. [Record the length.] If r > t, increase COUNT [t] by one, otherwise increase 
COUNT [r] by one. 

C6. [n found?] Increase s by one. If s < n, return to step C2. | 

For an example of this algorithm, see exercise T. We may think of a boy 
collecting d types of coupons, which are randomly distributed in his breakfast 
cereal boxes; he must keep eating more cereal until he has one coupon of each 
type. 

A chi-square test is to be applied to COUNT [d], COUNT [d + 1], ..., COUNT[t], 
with k = t — d-{- 1, after Algorithm C has counted n lengths. The corresponding 
probabilities are 



d < r < t; 




To derive these probabilities, we simply note that if q r denotes the probability 
that a subsequence of length r is incomplete, then 



by Eq. (5); for this means we have an r-tuple of elements that do not have 
all d different values. Then (6) follows from the relations p r = q r —\ — q r for 
d<r <t; p t = q t - 1 - 

For formulas that arise in connection with generalizations of the coupon 
collector’s test, see exercises 9 and 10 and also the paper by Hermann von 
Schelling, AMM 61 (1954), 306-311. 
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F. Permutation test. Divide the input sequence into n groups of t elements each, 
that is, ( Uj t , Ujt+ 1 ,..., Ujt+t— i) for 0 < j < n. The elements in each group 
can have t\ possible relative orderings; the number of times each ordering appears 
is counted, and a chi-square test is applied with k = t\ and with probability 1/t! 
for each ordering. 

For example, if t — 3 we would have six possible categories, according to 
whether U$j < < t/ 3 . 7+2 or U$j < Usj +2 < ^ 3 j+i or • • • or U^j- 1-2 < 

Uzj+i < U 3 j. We assume in this test that equality between C/’s does not occur; 
such an assumption is justified, for the probability that two C/’s are equal is zero. 

A convenient way to perform the permutation test on a computer makes use 
of the following algorithm, which is of interest in itself: 

Algorithm P (Analyze a permutation). Given a sequence of distinct elements 
(C/i,..., U t ), we compute an integer f(U \,... ,U t ) such that 

0 < /(C/i.C/ t ) < t\, 

and f(Ui, = /(Vi,..., V t ) if and only if (C/i,..., U t ) and (Vi,..., Vt) 

have the same relative ordering. 

PI. [Initialize.] Set r t, f <— 0. (During this algorithm we will have 0 < / < 
t\/r\.) 

P2. [Find maximum.] Find the maximum of {U \,..., U r }, and suppose that U s 
is the maximum. Set / <— r • / -f- s — 1, 

P3. [Exchange.] Exchange U r U s . 

P4. [Decrease r] Decrease r by one. If r > 1, return to step P2. | 

Note that the sequence (C/i,, Ut) will have been sorted into ascending or¬ 
der when this algorithm stops. To prove that the result / uniquely characterizes 
the initial order of (Ui, .. U t ), we note that Algorithm P can be run backwards: 
For r = 2, 3, ..., t, set s <— /modr, / If /r\, and exchange U r U s . It is easy 
to see that this will undo the effects of steps P2-P4; hence no two permutations 
can yield the same value of /, and Algorithm P performs as advertised. 

The essential idea that underlies Algorithm P is a mixed-radix representation 
called the “factorial number system”: Every integer in the range 0 < / < t\ can 
be uniquely written in the form 

/ — (• • •( Ct—i X (t — 1) -f- c t — 2 ) X (t — 2) -)-+ c 2 ) X 2 + Ci 

= (t — 1)! c t —1 + (t — 2)! Ct —2 -j- • • * -j- 2! C 2 H~ l! c i (^) 

where the “digits” Cj are integers satisfying 

0 < Cj < j, for 1 < j < t. (8) 

In Algorithm P, c r ._ 1 = s — 1 when step P2 is performed for a given value of r. 
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G. Run test. A sequence may also be tested for “runs up” and “runs down.” 
This means we examine the length of monotone subsequences of the original 
sequence, i.e., segments that are increasing or decreasing. 

As an example of the precise definition of a run, consider the sequence of ten 
numbers “1298536704”; putting a vertical line at the left and right and between 
X 3 and whenever X 3 > Xj. j_i, we obtain 


1298536704 



which displays the “runs up”: there is a run of length 3, followed by two runs 
of length 1, followed by another run of length 3, followed by a run of length 2. 
The algorithm of exercise 12 shows how to tabulate the length of “runs up.” 

Unlike the gap test and the coupon collector’s test (which are in many 
other respects similar to this test), we should not apply a chi-square test to the 
above data, since adjacent runs are not independent. A long run will tend to be 
followed by a short run, and conversely. This lack of independence is enough to 
invalidate a straightforward chi-square test. Instead, the following statistic may 
be computed, when the run lengths have been determined as in exercise 12: 

V =- Y' (COUNTfi] — n&OfCOUNTh'l — nbAa^, (10) 

71 ^ 

1<M<6 

where n is the length of the sequence, and the coefficients a l3 and b t are equal 
to 


/All fll 2 fll3 fll4 fll5 fll 6 \ 

/4529.4 

9044.9 13568 

18091 

22615 

27892\ 

fl 21 fl 22 fl23 fl24 A25 A26 


9044.9 

18097 27139 

36187 

45234 

55789 

A31 ft 32 A33 fl 34 A35 A36 


13568 

27139 40721 

54281 

67852 

83685 

fl41 <242 A43 A44 A45 A46 


18091 

36187 54281 

72414 

90470 

111580 

fl51 A 52 A53 A 54 A55 fl-56 


22615 

45234 67852 

90470 113262 

139476 

Vflei Cl$2 A63 fl-64 A65 A66' 


v 27892 

55789 83685 111580 139476 

172860' 


( 11 ) 

(&i b 2 b 3 6 4 b 5 b G ) — (J yjo 7^5 5 ^ gin)- 

(The values of a tJ shown here are approximate only; exact values may be obtained 
by using formulas derived below.) The statistic V in (10) should have the chi- 
square distribution with six (not five) degrees of freedom, when n is large. The 
value of n should be, say, 4000 or more. The same test can be applied to “runs 
down.” 

A vastly simpler and more practical run test appears in exercise 14, so 
a reader who is interested only in testing random number generators should 
skip the next few pages and go on to the “maximum-of-i test” after looking at 
exercise 14. On the other hand it is instructive from a mathematical standpoint 
to see how a complicated run test with interdependent runs can be treated, so 
we shall now digress for a moment. 
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Given any permutation on n elements, let Z pi = 1 if position i is the 
beginning of an ascending run of length p or more, and let Z px — 0 otherwise. 
For example, consider the permutation (9) with n = 10; we have 

Z\\ = Z 21 = Z 31 = Z14 = Z\ 5 = Z \6 = ^26 = ^36 = -^19 = ^29 = F 


and all other Z 's are zero. With this notation, 

Rp = Z p \ -f- Z V 2 H-1- Z pn (12) 

is the number of runs of length > p, and 

R p = R' p ~R' p+l (13) 

is the number of runs of length p exactly. Our goal is to compute the mean value 
of R p , and also the covariance 

covar (R p , R q ) = mean((/? p — mean(/? p ))(/? g — mean(i? g ))), 


which measures the interdependence of R p and R q . These mean values are to 
be computed as the average over the set of all n\ permutations. 

Equations (12) and (13) show that the answers can be expressed in terms of 
the mean values of Z v% and of Z pi Z qj , so as the first step of the derivation we 
obtain the following results (assuming that i <3)- 


f(p + 6*i)/(p+l)!, if i < n — pH-1; 

\0, otherwise. 

'(p + 6 ll )q/(p-\-l)\(q + l)\, 

if i + p<j < n — q + 1; ( 14 ) 

< (P + &n)/{P + — (P + Q + 6*i )/(p + q + I)-? 

tfi + p = j<n — q + l', 

.0, otherwise. 

The X) _s ig ns stand for a summation over all possible permutations. To illustrate 
the calculations involved here, we will work the most difficult case, when i-\-p = 
j < n — q -j- 1, and when i > 1. Note that Z pl Z q j is either zero or one, so the 
summation consists of counting all permutations U1U2 ... U n for which Z pl = 
Zqj — lj that is, all permutations such that 



Vi —1 > Vi < < Ui+p—i > Ui+p < ••• < Ui+p+q—i. (15) 

The number of such permutations may be enumerated as follows: there are 
ways to choose the elements for the positions indicated in (15); there 
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are 


(P + Q + 1)| 


> + <1 
V 




(16) 


ways to arrange them in the order (15), as shown in exercise 13; and there 
are (n — p — q — 1)! ways to arrange the remaining elements. Thus there are 
( p +g+i)(n — P — <7 — 1)! times (16) ways in all, and dividing by n! we get the 
desired formula. 

From relations (14) a rather lengthy calculation leads to 


mean(i?p) = mean(Z p i -}-f Z pn ) 

= {n + 1 )p/(p + 1)1 — (p — 1 )/p\, 1 < P < n; 

covar(f?p, R' q ) = mean (R p R' q ) — mean (R p ) mean (R' q ) 

= 12 hl2 Z P' Z PJ ~ mean(R' p )mea,n(R' q ) 

_ fmean (R' t ) + f(p, q, n), if V + Q < 

(meanf.RO — me&n(R' p )mea,n(R' q ), if p -f q > n, 


(17) 


(18) 


where t = max(p, q), s = p-\- q, and 


f(p,q,n) = (n + 1) 


+ 


s(l — pq) + pq _ 2s 
.(P+ 1 )Kv + !) ! (s + 1)!. 

- p 2 q 2 + 1 


(s 2 — s — 2 )pq — s‘ 


(P +!)!(<?+ 1)! 



(19) 


This expression for the covariance is unfortunately quite complicated, but it is 
necessary for a successful run test as described above. From these formulas it is 
easy to compute 


mean(i? p ) = meanf.Rp) — mean(H / p _ ( _ 1 ), 
covar(i? p , R' q ) = covar (R' p , R q ) — covar(/? p+1 , R' q ), (20) 

covar(/? p , R q ) = covar(7? p , Rf ) — covar (R p , +1 ). 

In Annals Math. Stat. 15 (1944), 163-165, J. Wolfowitz proved that the quantities 
Ri, R 2 , ..., Rt—i, R[ become normally distributed as n -+ 00, subject to the 
mean and covariance expressed above; this implies that the following test for 
runs is valid: Given a sequence of n random numbers, compute the number of 
runs R p of length p for 1 < p < t, and also the number of runs R' t of length 
t or more. Let 

Qi=Ri— mean(i?i), ..., Q t -1 = R t -1 — mean^-i), 

Q t — R[ — mean(R' t ). 
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Form the matrix C of the covariances of the R' s; for example, C 13 = 
covar(i?i, i? 3 ), while C\ t = covar(i?i, R' t ). When t = 6 , we have 

C - nC\ -j- C 2 



( 22 ) 


if n > 12 . Now form A = (a^), the inverse of the matrix C, and compute 
Ei<i j<t QiQj a ij- The result for large n should have approximately the chi- 
square distribution with t degrees of freedom. 

The matrix (11) given earlier is the inverse of C\ to five significant figures. 
When n is large, A will be approximately (l/nJCj -1 ; a test with n — 100 showed 
that the entries aij in (11) were each about 1 percent lower than the true values 
obtained by inverting ( 22 ). 

H. Maximum-of-t test. For 0 < j < n, let Vj = m&x(U t j , Utj+ 1 ,..., U t j+t— i)- 
Now apply the Kolmogorov-Smirnov test to the sequence Vo, Vi, ..., V n —i, 
with the distribution function F(x) = x l , 0 < x < 1. Alternatively, apply the 
equidistribution test to the sequence Vj, V\, ..., V^ l _ 1 . 

To verify this test, we must show that the distribution function for the Vj is 
F(x) = x The probability that max(U\, C/ 2 ,. - -, Ut) < x is the probability that 
U\ < x and U 2 < x and ... and U t < x, which is the product of the individual 
probabilities, namely xx ... x = x l . 

I. Collision test. Chi-square tests can be made only when there is a nontrivial 
number of items expected in each category. But there is another kind of test 
that can be used when the number of categories is much larger than the number 
of observations; this test is related to “hashing,” an important method for 
information retrieval that we shall study in Chapter 6 . 




3 . 3.2 


EMPIRICAL TESTS 


69 


Suppose we have m urns and we throw n balls at random into those urns, 
where m is much greater than n. Most of the balls will land in urns that were 
previously empty, but if a ball falls into an urn that already contains at least one 
ball we say that a “collision” has occurred. The collision test counts the number 
of collisions, and a generator passes this test if it doesn’t induce too many or too 
few collisions. 

To fix the ideas, suppose m = 2 20 and n = 2 14 . Then each urn will receive 
only one 64th of a ball, on the average. The probability that a given urn will 
contain exactly k balls is pk = (J)m —fc (l — m ~ 1 ) n ~ k , so the expected number 
of collisions per urn is Efc>i(^—!)Pfc = E^>o k ^~ E>c>i Pk = n/m—l+p 0 . 
Since p 0 = (1 — m~ l ) n = 1 — n/m -f- (J)m —2 -f- smaller terms, we find that 
the average total number of collisions taken over all m urns is very slightly less 
than n 2 /2m = 128. 

We can use the collision test to rate a random number generator in a large 
number of dimensions. For example, when m = 2 20 and n = 2 14 we can test 
the 20-dimensional randomness of a number generator by letting d = 2 and 
forming 20-dimensional vectors V 3 = {Y 20 j,Y 2 oj+i, ..., y 2 0 ;+i 9 ) for 0 < j < n. 
It suffices to keep a table of m = 2 20 bits to determine collisions, one bit for 
each possible value of the vector Vj\ on a computer with 32 bits per word, this 
amounts to 2 15 words. Initially all 2 20 bits of this table are cleared to zero; then 
for each Vj, if the corresponding bit is already 1 we record a collision, otherwise 
we set the bit to 1. This test can also be used in 10 dimensions with d = 4, and 
so on. 

To decide if the test is passed, we can use the following table of percentage 
points when m — 2 20 and n = 2 14 : 

collisions < 101 108 119 126 134 145 153 

with probability .009 .043 .244 .476 .742 .946 .989 

The theory underlying these probabilities is the same we used in the poker test, 
Eq. (5); the probability that c collisions occur is the probability that n — c urns 
are occupied, namely 

m(m — 1)... (m — n + c 1) J n 
m n \n — c 

Although m and n are very large, it is not difficult to compute these probabilities 
using the following method: 

Algorithm S (Percentage points for collision test). Given m and n, this algorithm 
determines the distribution of the number of collisions that occur when n balls 
are scattered into m urns. An auxiliary array A[0], A[l], ..., A[n\ of floating 
point numbers is used for the computation; actually A[j] will be nonzero only for 
jo < j < ji, and ji — jo will be at most of order logn, so it would be possible 
to get by with considerably less storage. 
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51. [Initialize.] Set A[j] <— 0 for 0 < j < n; then set A[l] <— 1 and ^'o <— ji 1- 
Then do step S2 exactly n — 1 times and go on to step S3. 

52. [Update probabilities.] (Each time we do this step it corresponds to tossing 
a ball into an urn; A[j] represents the probability that exactly j of the urns 
are occupied.) Set j\ <— j\ + 1. Then for j ji, ji — 1, ..., jo (in this 
order), set A[j] <- (j/m)A[j] + ((1 + 1/m) — (j/m))A[j — 1]. If A[j] has 
become very small as a result of this calculation, say A[j\ < 10 —20 , set 
A[j] <— 0; and in such a case, if j = j\ decrease ji by 1, or if j = jo increase 
jo by i- 

53. [Compute the answers.] In this step we make use of an auxiliary table 
(Ti, T 2 ,..., Ttmax) = (.01, .05, .25, .50, .75, .95, .99, 1.00) containing the 
specified percentage points of interest. Set p <— 0, t <— 1, and j <— jo — 1. 
Do the following iteration until t — tmax: Increase j by 1, and set p <— 
p-\-A[j\; then if p Tj, output yi j 1 and 1 p (meaning that with 
probability 1 — p there are at most n — j — 1 collisions) and repeatedly 
increase t by 1 until p <T t . | 


J. Serial correlation test. We may also compute the following statistic: 


n(UoU\ -f- U1U2 + U n — 2 U n —\ “h Un—iUo) 
_ -(Uo + Ui+--- + t4_i) 2 

n(Ul + U\ + • • • + U*_,) - (U 0 + U, + • • • + U n -i ) 2 ' 


This is the “serial correlation coefficient,” which is a measure of the amount 
Uj+i depends on Uj. 

Correlation coefficients appear frequently in statistics; if we have n quantities 
U 0 , Ui f ..., U n —1 and n others Vo, Vi, ..., V n —i, the correlation coefficient 
between them is defined to be 


n SOW) - (E t/j)(E Vj) 
j(n E V) - (E Uj) 2 )( n E V* - (E Vif) ' 


(24) 


All summations in this formula are to be taken over the range 0 < j < n; Eq. 
(23) is the special case V 3 — U(j+i) m odn • (Note: The denominator of (24) is 
zero when Uo = U\ = • • * = U n —i or Vo = Vi = • • • = V n —i ; we exclude this 
case from discussion.) 

A correlation coefficient always lies between —1 and -f-1. When it is zero 
or very small, it indicates that the quantities Uj and V) are (relatively speaking) 
independent of each other, but when the correlation coefficient is ^1 it indicates 
total linear dependence; in fact !/■ = ft ^ / 3Uj for all j in such a case, for some 
constants a and / 3 . (See exercise 17.) 

Therefore it is desirable to have C in Eq. (23) close to zero. In actual 
fact, since UqUi is not completely independent of UiU 2 , the serial correlation 
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coefficient is not expected to be exactly zero. (See exercise 18.) A “good” value 
of C will be between p n — 2<r n and p n + 2a n , where 


f^n —- 


n(n — 3) 


i i 

n — 1 


'n 


n — 1 V n -)- 1 


n > 2. 


(25) 


We expect C to be between these limits about 95 percent of the time. 

Equations (25) are only conjectured at this time, since the exact distribution 
of C is not known when the C/’s are uniformly distributed. For the theory when 
the C/’s have the normal distribution, see the paper by Wilfrid J. Dixon, Annals 
Math. Stat. 15 (1944), 119-144. Empirical evidence suggests that we may safely 
use the formulas for the mean and standard deviation that have been derived 
from the assumption of the normal distribution, without much error; these are 
the values that have been listed in (25). It is known that lim n ^ 00 \fno n = 1; cf. 
the article by Anderson and Walker, Annals Math . Stat. 35 (1964), 1296-1303, 
where more general results about serial correlations of dependent sequences are 
derived. 

Instead of simply computing the correlation coefficient between the obser¬ 
vations (C/ 0 , Ui, ..., U n — i) and their immediate successors (Ui, ..., U n — i, Uq), 
we can also compute it between (Uo, U \,..., U n —i) and any cyclically shifted 
sequence (U q ,..., U n —\, Uo ,..., U q —\)\ the cyclic correlations should be small 
for 0 < q < n. A straightforward computation of Eq. (24) for all q would 
require about n 2 multiplications, but it is actually possible to compute all the 
correlations in only 0(n log n) steps by using “fast Fourier transforms.” (See 
Section 4.6.4; cf. also L. P. Schmid, CACM 8 (1965), 115.) 

K. Tests on subsequences. It frequently happens that the external program 
using our random sequence will call for numbers in batches. For example, if the 
program works with three random variables X, Y, and Z, it may consistently 
invoke the generation of three random numbers at a time. In such applications it 
is important that the subsequences consisting of every third term of the original 
sequence be random. If the program requires q numbers at a time, the sequences 


Uo, U q , U2q, • • . 5 U\, Dqr+lj L^q + l, . • • 5 U q —i 1 U2q — l,U^ q —i,... 

can each be put through the tests described above for the original sequence Uq, 
U u U 2 , ... . 

Experience with linear congruential sequences has shown that these derived 
sequences rarely if ever behave less randomly than the original sequence, unless q 
has a large factor in common with the period length. On a binary computer with 
m equal to the word size, for example, a test of the subsequences for q = 8 will 
tend to give the poorest randomness for all q < 16; and on a decimal computer, 
q = 10 yields the subsequences most likely to be unsatisfactory. (This can be 
explained somewhat on the grounds of potency, since such values of q will tend 
to lower the potency.) 
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L. Historical remarks and further discussion. Statistical tests arose naturally 
in the course of scientists’ efforts to “prove” or “disprove” hypotheses about 
various observed data. The best known early papers dealing with the testing of 
artificially generated numbers for randomness are two articles by M. G. Kendall 
and B. Babington-Smith in the Journal of the Royal Statistical Society 101 
(1938), 147-166, and in the supplement to that journal, 6 (1939), 51-61. These 
papers were concerned with the testing of random digits between 0 and 9, rather 
than random real numbers; for this purpose, the authors discussed the frequency 
test, serial test, gap test, and poker test, although they misapplied the serial 
test. Kendall and Babington-Smith also used a variant of the coupon collector’s 
test; the method described in this section was introduced by R. E. Greenwood 
in Math. Comp. 9 (1955), 1-4. 

The run test has a rather interesting history. Originally, tests were made 
on runs up and down at once: a run up would be followed by a run down, then 
another run up, and so on. Note that the run test and the permutation test 
do not depend on the uniform distribution of the IT s, they depend only on the 
fact that Ui = Uj occurs with probability zero when % ^ j\ therefore these tests 
can be applied to many types of random sequences. The run test in primitive 
form was originated by J. Bienayme [ComptesRendus 81 (Paris: Acad. Sciences, 
1875), 417-423]. Some sixty years later, W. O. Kermack and A. G. McKendrick 
published two extensive papers on the subject (Proc. Royal Society Edinburgh 
57 (1937), 228-240, 332-376]; as an example they stated that Edinburgh rainfall 
between the years 1785 and 1930 was “entirely random in character” with respect 
to the run test (although they examined only the mean and standard deviation 
of the run lengths). Several other people began using the test, but it was not 
until 1944 that the use of the chi-square method in connection with this test was 
shown to be incorrect. The paper by H. Levene and J. Wolfowitz in Annals Math. 
Stat. 15 (1944), 58-69, introduced the correct run test (for runs up and down, 
alternately) and discussed the fallacies in earlier misuses of that test. Separate 
tests for runs up and runs down, as proposed in the text above, are more suited 
to computer application, so we have not given the more complex formulas for 
the alternate-up-and-down case. See the survey paper by D. E. Barton and C. L. 
Mallows, Annals Math. Stat. 36 (1965), 236-260. 

Of all the tests we have discussed, the frequency test and the serial correla¬ 
tion test seem to be the weakest, in the sense that nearly all random number 
generators pass these tests. Theoretical grounds for the weakness of these tests 
are discussed briefly in Section 3.5 (cf. exercise 3.5-26). The run test, on the 
other hand, is a rather strong test: the results of exercises 3.3.3-23 and 24 sug¬ 
gest that linear congruential generators tend to have runs somewhat longer than 
normal if the multiplier is not large enough, so the run test of exercise 14 is 
definitely to be recommended. 

The collision test is also highly recommended, since it has been especially 
designed to detect the deficiencies of many poor generators that have unfor¬ 
tunately become widespread. This test, which is based on ideas of H. Delgas 
Christiansen [Inst. Math. Stat. and Oper. Res., Tech. Univ. Denmark (Oct. 1975), 
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unpublished], is unlike the others in that it was not developed before the advent 
of computers; it is specifically intended for computer use. 

The reader probably wonders, “Why are there so many tests?” It has been 
said that more computer time is spent testing random numbers than using them 
in applications! This is untrue, although it is possible to go overboard in testing. 

The need for making several tests has been amply documented. It has been 
recorded, for example, that some numbers generated by a variant of the middle- 
square method have passed the frequency test, gap test, and poker test, yet 
flunked the serial test. Linear congruential sequences with small multipliers have 
been known to pass many tests, yet fail on the run test because there are too 
few runs of length one. The maximum-ofT test has also been used to ferret out 
some bad generators that otherwise seemed to perform respectably. 

Perhaps the main reason for doing extensive testing on random number 
generators is that people misusing Mr. X’s random number generator will hardly 
ever admit that their programs are at fault: they will blame the generator, until 
Mr. X can prove to them that his numbers are sufficiently random. On the other 
hand, if the source of random numbers is only for Mr. X’s personal use, he might 
decide not to bother to test them, since the techniques recommended in this 
chapter have a high probability of being satisfactory. 

EXERCISES 

1. [10] Why should the serial test described in part B be applied to {Yq,Y \), (Y 2 , Y 3 ), 
..., (Y 2 n - 2 ,Y 2 n -i) instead of to (Yq, Yi), (Yi,Y 2 ), ..., (Yn- i,Y n )? 

2. [10] State an appropriate way to generalize the serial test to triples, quadruples, 
etc., instead of pairs. 

► 3. [M20] How many U 's need to be examined in the gap test (Algorithm G) before 
n gaps have been found, on the average, assuming that the sequence is random? What 
is the standard deviation of this quantity? 

4. [12] Prove that the probabilities in (4) are correct for the gap test. 

5. [M23] The “classical” gap test used by Kendall and Babington-Smith considers 
the numbers Uo, Ui, ..., Un —1 to be a cyclic sequence with CT/v+j identified with U 3 . 
Here N is a fixed number of C/’s that are to be subjected to the test. If n of the numbers 
Uo, ..., Un—i fall into the range a < U 3 < ft, there are n gaps in the cyclic sequence. 
Let Z r be the number of gaps of length r, for 0 < r < t, and let Z t be the number 
of gaps of length > t ; show that the quantity V = X]o <r<t(^ r — np r ) 2 /np r should 
have the chi-square distribution with t degrees of freedom, in the limit as N goes to 
infinity, where p r is given in Eq. (4). 

6. [4.O] (H. Geiringer.) A frequency count of the first 2000 decimal digits in the 
representation of e = 2.71828... gave a y 2 value of 1.06, indicating that the actual 
frequencies of the digits 0, 1, ..., 9 are much too close to their expected values to be 
considered randomly distributed. (In fact, y 2 > 1.15 with probability 99.9 percent.) 
The same test applied to the first 10,000 digits of e gives the reasonable value y 2 — 
8.61; but the fact that the first 2000 digits are so evenly distributed is still surprising. 
Does the same phenomenon occur in the representation of e to other bases? [See AMM 
72 (1965), 483-500.] 
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7. [08] Apply the coupon collector’s test procedure (Algorithm C) with d = 3 
and n = 7, to the following sequence: 1101221022120202001212201010201121. What 
lengths do the seven subsequences have? 

► 8. [M22] How many H’s need to be examined, on the average, in the coupon collec¬ 
tor’s test (Algorithm C) before n complete sets have been found, assuming that the 
sequence is random? What is the standard deviation? [Hint: See Eq. 1.2.9-28.] 

9. [ M21 ] Generalize the coupon collector’s test so that the search stops as soon as 
w distinct values have been found, where w is a fixed positive integer less than or equal 
to d. What probabilities should be used in place of (6)? 

10. [ M23} Solve exercise 8 for the more general coupon collector’s test described in 
exercise 9. 

11. [00] The “runs up” in a particular permutation are displayed in (9); what are the 
“runs down” in that permutation? 

12 . [20] Let Ho, Hi, ..., U n —\ be n distinct numbers. Write an algorithm that 
determines the lengths of all ascending runs in the sequence. When your algorithm 
terminates, C0UNT[r] should be the number of runs of length r, for 1 < r < 5, and 
COUNT[6] should be the number of runs of length 6 or more. 

13. [M23] Show that (16) is the number of permutations of p+q-fl distinct elements 
having the pattern (15). 

► 14. [ M15] If we “throw away” the element that immediately follows a run, so that 
when Xj is greater than Xj+i we start the next run with Xj+ 2 , the run lengths are 
independent, and a simple chi-square test may be used (instead of the horribly compli¬ 
cated method derived in the text). What are the appropriate run-length probabilities 
for this simple run test? 

15. [ M10 ] In the maximum-of-£ test, why are Vq, V{, ..., V*_i supposed to be 
uniformly distributed between zero and one? 

► 16. [15] (a) Mr. J. H. Quick (a student) wanted to perform the maximum-of-i test 
for various values of t. Letting Z jt = ma x(U 3 , U 3+ 1 ,..., U 3+t - 1 ), he found a clever 
way to go from the sequence Zo( t — 1 ), Z i(t—i), ■ ■ ■, to the sequence Zot, Z\ t , ..., using 
very little time and space. What was his bright idea? 

(b) He decided to modify the maximum-of-i method so that the jth observation 
would be max(Hj,..., Uj+ t — 1 ); in other words, he took V 3 = Z 3t instead of V 3 — Z^ 3 ) t 
as the text says. He reasoned that all of the Z’ s should have the same distribution, 
so the test is even stronger if each Z 3t , 0 < j < n, is used instead of just every £th 
one. But when he tried a chi-square equidistribution test on the values of Vj, he got 
extremely high values of the statistic V, which got even higher as t increased. Why 
did this happen? 

17. [M25] (a) Given any numbers H 0 ,..., U n -i, V 0> ..., V n —i, let 

u = - U k , v = - V k . 

0<fc<n 0<fc<n 

Let U' k = Uk — u, V' k = Vk — v. Show that the correlation coefficient C given in Eq. 
(24) is equal to 
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(b) Let C = N/D, where N and D denote the numerator and denominator of 
the expression in part (a). Show that N 2 < D 2 , hence —1 < C < 1; and obtain a 
formula for the difference D 2 — N 2 . [Hint: See exercise 1.2.3-30.] 

(c) If C — ^1, show that aX k + f3Y k — r, 0 < k < n, for some constants a, j3, 
and r, not all zero. 

18. [ M20\ (a) Show that if n = 2, the serial correlation coefficient (23) is always equal 
to —1 (unless the denominator is zero), (b) Similarly, show that when n = 3, the serial 
correlation coefficient always equals —(c) Show that the denominator in (23) is zero 
if and only if Uo = U\ = ■ • ■ = U n -i- 

19. [M40] What are the mean and standard deviation of the serial correlation coeffi¬ 
cient (23) when n = 4 and the U' s are independent and uniformly distributed between 
zero and one? 

20. [M47\ Find the distribution of the serial correlation coefficient (23), for general n, 
assuming that the Uj are independent random variables uniformly distributed between 
zero and one. 

21. [19] What value of / is computed by Algorithm P if it is presented with the 
permutation (1, 2,9,8, 5,3,6, 7,0,4)? 

22. [18] For what permutation of {0,1, 2, 3,4, 5, 6, 7, 8, 9} will Algorithm P produce 
the value / = 1024? 


*3.3.3. Theoretical Tests 

Although it is always possible to test a random number generator using the 
methods in the previous section, it is far better to have “a priori tests,” i.e., 
theoretical results that tell us in advance how well those tests will come out. 
Such theoretical results give us much more understanding about the generation 
methods than empirical, “trial-and-error” results do. In this section we shall 
study the linear congruential sequences in more detail; if we know what the 
results of certain tests will be before we actually generate the numbers, we have 
a better chance of choosing a, m, and c properly. 

The development of this kind of theory is quite difficult, although some 
progress has been made. The results obtained so far are generally for statistical 
tests made over the entire period. Not all statistical tests make sense when 
they are applied over a full period—for example, the equidistribution test will 
give results that are too perfect—but the serial test, gap test, permutation test, 
maximum test, etc. can be fruitfully analyzed in this way. Such studies will 
detect global nonrandomness of a sequence, i.e., improper behavior in very large 
samples. 

The theory we shall discuss is quite illuminating, but it does not eliminate the 
need for testing local nonrandomness by the methods of Section 3.3.2. Indeed, it 
appears to be extremely hard to prove anything useful about short subsequences. 
Only a few theoretical results are known about the behavior of linear congruential 
sequences over less than a full period; these will be discussed at the end of Section 
3.3.4. (See also exercise 18.) 
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Let us begin with a proof of a simple a priori law, for the least complicated 
case of the permutation test. The gist of our first theorem is that we have 
X n _|_i < X n about half the time, provided that the sequence has high potency. 

Theorem P. Let a, c, and m generate a linear congruential sequence with 
maximum period; let b = a — 1 and let d be the greatest common divisor of m 
and b. The probability that X n+1 < X n is equal to J + r, where 

r = (2 (c mod d) — d) /2m; (1) 

hence |r| < d/2m. 

Proof. The proof of this theorem involves some techniques that are of interest 
in themselves. First we define 


s(z) — (ax -|- cjmodm. (2) 

Thus, X n+ i = s(X n ), and the theorem reduces to counting the number of 
integers x such that 0 < x < m and s(x ) < x (since each such integer occurs 
somewhere in the period). We want to show that this number is 

\ (m -f 2(c mod d) — d ). (3) 

The function \(x — s(x))/m\ is equal to 1 when x > s(:r), and it is 0 
otherwise; hence the count we wish to obtain can be written simply as 



(Recall that [—y \ = — [y\ and b = a — 1.) Such sums can be evaluated by the 
method of exercise 1.2.4-37, where we have proved that 


£ 

0<j<k 


hj + c _ (,h — l)(fc — 1) q — 
k 2 ' 2 


+ 9L c/g\, g = gcd{h,k), (5) 


whenever h and k are integers and k > 0. Since a is relatively prime to m, this 
formula yields 




and (3) follows immediately. | 
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Fig. 7. The sawtooth function (( 0 )). 


The proof of Theorem P indicates that a priori tests can indeed be carried 
out, provided that we are able to deal satisfactorily with sums involving the [ J 
and [ ] functions. In many cases the most powerful technique for dealing with 
floor and ceiling functions is to replace them by two somewhat more symmetrical 
ones: 


m = l*j + 1 - r*i = {J’ 

if 0 is an integer; 
if 0 is not an integer; 

(6) 

((0)) — z — [z\ — \ 36(0) = 

2 ~\ 2 ]~\~ i ~ \S(z). 

(T) 


The latter function is a “sawtooth” function familiar in the study of Fourier 
series; its graph is shown in Fig. 7 . The reason for choosing to work with ((0)) 
rather than [z\ or \z ] is that ((0)) possesses several very useful properties: 

((-*)) = -((*)); ( 8 ) 

((2 + n)) = ((2)), integer n; ( 9 ) 

((nz)) = ((2)) + (( z + ^)) H-1" (( z + integer n ^ 1 ' ( 10 ) 


(See exercise 2 .) 

In order to get some practice working with these functions, let us prove 
Theorem P again, this time without relying on exercise 1 . 2 . 4 - 37 . With the help 
of Eqs. ( 7 ), (8), ( 9 ), we can show that 


x — s(x) 
m 

_x — s(x) f(x — (ax + c)Y\ 1 

m VV m )) 2 

_ x — s(x) , {{bx + c\\ 1 

-^^+vl^rJJ + 2 


x — s(x) (( X — s(x) 


m 


m 


, I _ hi- x ~ si - x > 

I 


m 


(ii) 


since {x — s(x))/m is never an integer. Now 


E 

0<x<m 


x — s(z) 


= 0 


m 
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since both x and 5 ( 2 :) take on each value of {0,1,..., m— 1} exactly once; hence 
(11) yields 

Let b = bod, m = m 0 d, where b 0 and m 0 are relatively prime. We know that 
( 60 ^) mod mo takes on the values { 0 , 1 ,..., mo — 1 } in some order as x varies 
from 0 to mo — 1. By (9) and (10) and the fact that 


b{x + m 0 ) + c 


bx + c 


we have 


0 < x < m 


bx C 


E 

0 < x < mo 


bx-\- C 




0<x<m 0 " 

Theorem P follows immediately from (12) and (13). 

One consequence of Theorem P is that practically any choice of a and c will 
give a reasonable probability that X n+1 < X n , at least over the entire period, 
except those that have large d. A large value of d corresponds to low potency, 
and we already know that generators of low potency are undesirable. 

The next theorem gives us a more stringent condition for the choice of a 
and c; we will consider the serial correlation test applied over the entire period. 
The quantity C defined in Section 3.3.2, Eq. (23), is 


= (m xs(x)-( Y, 

V 0<x<m ^0<x< 


E * 2 - E 

0 < X < TO ' v 0<X<TO 


V / 

Let x' be the element such that s(a: / ) = 0. We have 

■s(a:) = m(^^±^)) + y, if x fix'. (15) 

The formulas we are about to derive can be expressed most easily in terms of 
the function 

an important function that arises in several mathematical problems. It is called 
a generalized Dedekind sum , since Richard Dedekind introduced the function 
o(h, k, 0) in 1876 when commenting on one of Riemann’s incomplete manuscripts. 
[See B. Riemann’s Gesammelte Math. Werke , 2nd ed. (1892), 466-478.] 
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Using the well-known formulas 
m(rn — 1 ) 


E 


x 


and 


E * 2 = 


m(m — 1)(2 m — 1) 


0 <x<m 


0 <x<m 


6 


it is a straightforward matter to transform Eq. (14) into 


C = 


ma(a, m,c ) — 3 + 6 (m — x' — c) 
m 2 — 1 


( 17 ) 


(See exercise 5.) Since m is usually very large, we may discard terms of order 
1 /m, and we have the approximation 


C ^ cr(a,m,c)/m, 


(18) 


with an error of less than 6 /m in absolute value. 

The serial correlation test now reduces to determining the value of the 
Dedekind sum a(a, m, c). Evaluating a(a, m, c) directly from its definition (16) is 
hardly any easier than evaluating the correlation coefficient itself directly, but 
fortunately there are simple methods available for computing Dedekind sums 
quite rapidly. 

Lemma B (“Reciprocity law” for Dedekind sums). Let h, k, c be integers. If 
0 < c < k, 0 < h < k, and if h is relatively prime to k, then 


Ji Jc 1 

o(hXc) + °(k,h,c) = - + _ + - + — -6 


h 


— 3 e(h,c), (19) 


where 



if c = 0 or c mod h 7 ^ 0 ; 
if c > 0 and cmod/i = 0 . 



Proof. We leave it to the reader to prove that, under these hypotheses, 

— 3e(h,c) + 3. (21) 

(See exercise 6 .) The lemma now must be proved only in the case c = 0. 

The proof we will give, based on complex roots of unity, is essentially 
due to L. Carlitz. There is actually a simpler proof that uses only elementary 
manipulations of sums (see exercise 7)—but the following method reveals more 
of the mathematical tools that are available for problems of this kind and it is 
therefore much more instructive. 

Let f(x) and g(x) be polynomials defined as follows: 

f(x)= l + X + ...+X k ~ l =:(x k -l)/(x-l) 

g(x) = x + 2z 2 -f-b (k — l)^ fc—1 = xf\x) 

— kx k /(x — 1) — x(x k — l)/(x — l) 2 . 


cr(/i, k , c) -b &{k, h , c) = cr (h, k, 0 ) -f o(k, h, 0 ) -b 


6 c 2 

hk 


6 


(22) 
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If w is the complex A;th root of unity e 2?n / fc , we have by Eq. 1.2.9-13 

i ^ u~~ :?r g((jj 3 x) = rz r , if 0 < r < k. (23) 

Set x = 1; then g(w 3 x) — A:/(w J — 1 ) if j ^ 0, otherwise it equals k(k — l)/2, 
therefore 


r mod k = 


E 

0<j<k 


U) 3 — 1 


+ i( fc l)» 


if r is an integer. 


(Eq. (23) shows that the right-hand side equals r when 0 < r < fc, and it is 
unchanged when multiples of k are added to r.) Hence 


q) = i v jCL _ Jl + I«(i) 

kJJ k.fr'.w’ — 1 2k 2 \kJ 

0 <j<k 


(24) 


This important formula, which holds whenever r is an integer, allows us to reduce 
many calculations involving ((r/k)) to sums involving fcth roots of unity, and it 
brings a whole new range of techniques into the picture. In particular, we get 
the following formula: 


cr(/i, fc,0) 4- 


3 (k — 1 ) _ 12 
k 2 k 2 


E E E 

0<r </c 0 <i<k 0 <j< k 


aj~ ir u" 3hr 

w* — 1 u)j — 1 * 


(25) 


The right-hand side of this formula may be simplified by carrying out the sum 
on r; we have Z) 0 <r<fc a;rs ~ f( uS ) — 0 if smodfc 7 ^ 0. Equation (25) now 
reduces to 


a(h, k, 0) 4- 


3{k — 1) _ 12 
k k 


E 

0<j<k 


1 

{uj-J' h — l)(wJ — 1) ’ 



A similar formula is obtained for a(k, h, 0), with £ = e 2 ™/ h replacing u. 

It is not obvious what we can do with the sum in (26), but there is an elegant 
way to proceed, based on the fact that each term of the sum is a function of uj 3 , 
where 0 < j < k; hence the sum is essentially taken over the kth roots of unity 
other than 1 . Whenever xi, x 2 , ... , x n are distinct complex numbers, we have 
the identity 

v -1_ 

1 <j<n 3 2 ' 1 ) ’ * ' ^3 — iX 2 ' 1) • * ‘ ( x j ^n) 

_ _1_ 

(x~ Xi)...(x — x n ) ’ 


( 27 ) 
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which follows from the usual method of expanding the right-hand side into partial 
fractions. Moreover, if q(x) — [x — yi)(x — y 2 )...(x — y m ), we have 

<?%) = (y, — vi) ■ ■ ■ (Vi - Vj-i)(y, —Vj+ 1 ) ■ ■ ■ (% - y m ); ( 28 ) 

this identity may often be used to simplify expressions like those in the left-hand 
side of (27). When h and k are relatively prime, the numbers uj, u 2 , ..., u k ~ x , 
C, C 2 , •••> are all distinct; we can therefore consider formula (27) in the 

special case of the polynomial (x — u) ... (x — uj k ~ 1 )(x — $)...(x — ^ h ~ 1 ) = 
(. x k — l)(x h — 1)/{x — l) 2 , obtaining the following identity in x: 

1 - l ) 2 1 y- - l ) 2 _ (x — l ) 2 

h ah<hk ,k — f 5 ') k o<T<k^ jh ~ 1 X I — uS ) (i h — l)(x fc — 1)' 

( 29 ) 

This identity has many interesting consequences, and it leads to numerous reci¬ 
procity formulas for sums of the type given in Eq. ( 26 ). For example, if we 
differentiate ( 29 ) twice with respect to x and let x —> 1, we find that 


2 

h 


E 

0 <3<h 


C(C - 1) 2 
(i jk - i)(i - <r J ) 3 


+ 2 
k 


E 


uj 3 (uj j — 1)' 


o<><* 3h - w - ^) 3 

= i(* + *+-L)+i 

K\k ^ h ^ hk]^ 2 


1 

2 h 


1 

2k' 


Replace j by h — j and by k — j in these sums and use (26) to get 




i^cr(/i,A:, 


0 ) + 


3(fc — 1) 


k 

6\k ' h~ hk)' 2 2 h 


1 

2fc’ 


which is equivalent to the desired result. | 

Lemma B gives us an explicit function f(h, k, c) such that 


a(h,k,c) = f(h,k,c) — a(k,h,c) (30) 

whenever 0</i</c,0<c<A:, and h is relatively prime to k. From the 
definition (16) it is clear that 


a(k, h, c ) = cr(k mod h, h, c mod h). 


(31) 


Therefore we can use (30) iteratively to evaluate a(h, k, c), using a process that 
reduces the parameters as in Euclid’s algorithm. 

Further simplifications occur when we examine this iterative procedure more 
closely. Let us set mi — k, m2 = h, C\ = c, and form the following tableau: 


mi — ai?7i2 -j- m 3 

m 2 = a 2 m 3 + m 4 
m 3 = a 3 m 4 + m 5 
m 4 = a 4 m5 


ci — bim 2 + c 2 
c 2 = b 2 m 3 -j- c 3 
c 3 = b 3 m 4 -f c 4 
c 4 = 6 4 m 5 -j- C5 


( 32 ) 



RANDOM NUMBERS 


3.3.3 


Here 


flj = [my/TOj+iJ, 

rrij +2 = m 3 mod m^i, 


bj — l c j/ m j+ iJj 

Cj ^|— i — Cj mod 7YLj— i*. i , 


and it follows that 


0 < m J+ i < m^, 0 < Cj < rrtj. 


We have assumed for convenience that Euclid’s algorithm terminates in (32) after 
four iterations; this assumption will reveal the pattern that holds in the general 
case. Since h and k were relatively prime to start with, we must have ms = 1 
and cs — 0 in (32). 

Let us further assume that c 3 0 but C 4 = 0, in order to get a feeling for 
the effect this has on the recurrence. Equations (30) and (31) yield 

a(h, k, c) = <r(m 2 , mi, ci) 

= /(m 2 , mi, ci) — <j(m 3 , m 2 , c 2 ) 


= /(m 2 ,mi,ci) — /(m 3 ,m 2 ,c 2 ) + /(m 4 , m 3 , c 3 ) — f(m 5 , m 4 , c 4 ). 
The first part “/i/A; -f- of the formula for /(/i, k, c) in (19) contributes 

m 2 mi _ m 3 m 2 m 4 m 3 m 5 m 4 

mi m 2 m 2 m 3 m 3 m 4 m 4 m 5 


to the total, and this simplifies to 

h \( n m 3 / m 4 

7 + UH--U 2 H- 

k V m 2 / m 2 V m 3 


m 4 . / m 5 \ ms 

H- KUsH--a 4 

m 3 V. m 4 / m 4 

= /i/fc + a,i — a 2 + a 3 — a 4 . 


The next part “1 /hk n of (19) also leads to a simple contribution; according to 
Eq. 4.5.3-9 and other formulas in Section 4.5.3, we have 

l/mim 2 — l/m 2 m 3 -f- l/m 3 m 4 — l/m 4 m 5 = h!jk — 1 , ( 35 ) 

where h! is the unique integer satisfying 

h'h = 1 (modulo k), 0 < h! < k. (36) 

Adding up all the contributions, and remembering our assumption that c 4 = 0 
(so that e(m 4 ,c 3 ) — 0 , cf. ( 20 )), we find that 


0 (h, k, c) = 


h -f- hf 


4“ ( Q i — a 2 “b a 3 — < 24 ) — 6(&i — b 2 -j- ^3 — ^ 4 ) 


+ 6 f -1-) -f- 2, 

\mim 2 m 2 m 3 m 3 m 4 m 4 m 5 ) 


in terms of the assumed tableau (32). Similar results hold in general: 
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Theorem D. Let h, k, c be integers with 0<h<k, 0<c<k, and h 
relatively prime to k. Form the “Euclidean tableau” as defined in (33) above, 
and assume that the process stops after t steps with mt+i = 1. Let s be the 
smallest subscript such that c s = 0, and let h' be defined by (36). Then 


a(h, k, c) = 


h -\- h! 
k 


+ E (~l) J+1 k - 66,+6 

1 <j<t '■ 




+ 3((—l) s -f-<5si) — 2 + (—l) 4 . | 


Euclid’s algorithm is analyzed carefully in Section 4.5.3; the quantities a 4 , 
a 2 , ..., a t are called the partial quotients of h/k. Theorem 4.5.3F tells us that 
the number of iterations, t, will never exceed log^ /c; hence Dedekind sums can 
be evaluated rapidly. The terms c^/mjrrij ^-1 can be simplified further, and an 
efficient algorithm for evaluating a(h , k, c) appears in exercise 17. 

Now that we have analyzed generalized Dedekind sums, let us apply our 
knowledge to the determination of serial correlation coefficients. 

Example 1. Find the serial correlation when m = 2 35 , a = 2 34 -f-1, c = 1. 
Solution. We have 


C = ( 2 35 <j( 2 34 + 1 , 2 35 , 1 ) — 3 + 6 ( 2 35 — ( 2 34 - 1 ) - l))/( 2 70 


by Eq. (17). To evaluate er(2 34 -|- 1 , 2 35 , 1 ), we can form the tableau 


mi = 2 35 Ci = 1 

m 2 = 2 34 + 1 ai = 1 c 2 = 1 

m 3 = 2 34 — 1 a 2 = 1 c 3 = 1 

7714 = 2 a 3 = 2 33 — 1 C4 = 1 

ra 5 = l a 4 = 2 c 5 = 0 



1 ) 


Since h' — 2 34 -)- 1, the value according to Theorem D comes to 2 33 — 3 + 2 32 . 
Thus 

C = (2 68 + 5)/(2 70 — 1) = \ + £, |e| < 2“ 67 . (37) 

Such a correlation is much, much too high for randomness. Of course, this 
generator has very low potency, and we have already rejected it as nonrandom. 

Example 2. Find the approximate serial correlation when m = 10 10 , a = 10001, 
c = 2113248653. 


Solution. 

We have C 

~ a(a,m,c)/m, and the computation proceeds as follows: 

m i = 

10000000000 

ci = 2113248653 


m 2 = 

10001 

cli = 999900 c 2 = 7350 

6 , = 211303 

m 3 = 

100 

ci '2 — 100 c 3 = 50 

b 2 = 73 

7774 = 

1 

a 3 = 100 c 4 = 0 

63 = 50 



<r(m 2 ,mi,ci) = —31.6926653544; 




C w —3 • 10 —9 . 

(38) 
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This is a very respectable value of C indeed. But the generator has a potency 
of only 3, so it is not really a very good source of random numbers in spite of 
the fact that it has low serial correlation. It is necessary to have a low serial 
correlation, but not sufficient. 

Example 3. Estimate the serial correlation for general a, m, and c. 

Solution. If we consider just one application of (30), we have 

771 C 2 C 

o (a, m, c) «-1- 6-6- aim, a, c ). 

a am a 

Now \a(m,a,c)\ < a by exercise 12, and therefore 


c ^ <r{a, m, c) 1 
m a 



(39) 


The error in this approximation is less than (a -f- 6)/m in absolute value. 

The estimate in (39) was the first theoretical result known about the ran¬ 
domness of congruential generators. R. R. Coveyou [JACM 7 (1960), 72-74] 
obtained it by averaging over all real numbers x between 0 and m instead of con¬ 
sidering only the integer values (cf. exercise 21); then Martin Greenberger [Math. 
Comp. 15 (1961), 383-389] gave a rigorous derivation including an estimate of 
the error term. 

So began one of the saddest chapters in the history of computer science! 
Although the above approximation is quite correct, it has been grievously misap¬ 
plied in practice; people abandoned the perfectly good generators they had 
been using and replaced them by terrible generators that looked good from the 
standpoint of (39). For more than a decade, the most common random number 
generators in daily use were seriously deficient, solely because of a theoretical 
advance. A little knowledge is a dangerous thing. 

If we are to learn by past mistakes, we had better look carefully at how (39) 
has been misused. In the first place people assumed uncritically that a small 
serial correlation over the whole period would be a pretty good guarantee of 
randomness; but in fact it doesn’t even ensure a small serial correlation for 1000 
consecutive elements of the sequence (see exercise 14). 

Secondly, (39) and its error term will ensure a relatively small value of C only 
when a « / m; therefore people suggested choosing multipliers near yfm. In fact, 
we shall see that nearly all multipliers give a value of C that is substantially less 
than 1/ y/m, hence (39) is not a very good approximation to the true behavior. 
Minimizing a crude upper bound for C does not minimize C. 

In the third place, people observed that (39) yields its best estimate when 
5 V 3 , since these values are the roots of 1 — 6x + 6x 2 = 0. “In the 
absence of any other criterion for choosing c, we might as well use this one.” The 
latter statement is not incorrect, but it is misleading at best, since experience has 
shown that the value of c has hardly any influence on the true value of the serial 
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correlation when a is a good multiplier; the choice c/m ^ i £\/3 reduces C 
substantially only in cases like Example 2 above. And we are fooling ourselves 
in such cases, since the bad multiplier will reveal its deficiencies in other ways. 

Clearly we need a better estimate than (39); and such an estimate is now 
available thanks to Theorem D, which stems principally from the work of U. 
Dieter [Math. Comp. 25 (1971), 855-883]. Theorem D implies that a(a,m,c) 
will be small if the partial quotients of a/m are small. Indeed, by analyzing 
generalized Dedekind sums still more closely, it is possible to obtain quite a 
sharp estimate: 

Theorem K. Under the assumptions of Theorem D, we always have 

X a i~ X a i+\<<y{Kk,c)< X a i+l X ( 40 ) 

* 1 <3<t 1 <j<t Z 1 <j<t * l<j<t 

j odd j even j odd j even 

Proof. See D. E. Knuth, Acta Arithmetica 33 (1978), 297-325, where it is shown 
further that these bounds are essentially the best possible when there are large 
partial quotients. | 

Example 4. Estimate the serial correlation for a = 3141592621, m = 2 35 , 
c odd. 

Solution. The partial quotients of a/m are 10, 1, 14, 1, 7, 1, 1, 1, 3, 3, 3, 5, 2, 
1, 8, 7, 1, 4, 1, 2, 4, 2; hence by Theorem K 

—45 < cr(a,m, c) < 68, 

and the serial correlation is guaranteed to be extremely low for all c. 

Note that this bound is considerably better than we could obtain from (39), 
since the error in (39) is of order a/m; our “random” multiplier has turned out 
to be much better than one specifically chosen to look good on the basis of (39). 
In fact, it is possible to show that the average value of 5Zi <j<t a v taken over 
all multipliers a relatively prime to m, is 

— (lnm) 2 -j- 0((logm)(loglogm) 4 ) 

(see exercise 4.5.3-35). Therefore the probability that a random multiplier has 
large J2i<j <t a t> sa y l ar S er than (logm) 2 + e for some fixed e > 0, approaches 
zero as m —► oo. This substantiates the empirical evidence that almost all 
linear congruential sequences have extremely low serial correlation over the entire 
period. 

The exercises below show that other a priori tests, such as the serial test over 
the entire period, can also be expressed in terms of a few generalized Dedekind 
sums. It follows from Theorem K that linear congruential sequences will pass 
these tests provided that certain specified fractions (depending on a and m but 
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not on c ) have small partial quotients. In particular, the result of exercise 19 
implies that the serial test on pairs will be satisfactorily passed if and only if 
a/m has no large partial quotients. 

The book Dedekind Sums by Hans Rademacher and Emil Grosswald (Math. 
Assoc, of America, Carus Monograph No. 16, 1972) discusses the history and 
properties of Dedekind sums and their generalizations. Further theoretical tests, 
including the serial test in higher dimensions, are discussed in Section 3.3.4. 


EXERCISES—First Set 


1. [ M10\ Express “xmody” in terms of the sawtooth and 6 functions. 

2. [M20] Prove the “replicative law,” Eq. (10). 

3. [ HM22] What is the Fourier series expansion (in terms of sines and cosines) of 
the function f(x) = ((x))? 

► 4. [ M19 ] If m = 10 10 , what is the highest possible value of d (in the notation of 
Theorem P), given that the potency of the generator is 10? 

5. [M21] Carry out the derivation of Eq. (17). 

6. [M27\ Let hb! -f kk' — 1. (a) Show, without using Lemma B, that 


cr{h,k,c) = a{h,k,0) + 12 ^ ffV)) 6 (f xl) 

o<j<c kV // \V // 


for all integers c > 0. (b) Show that if 0 < j < k, 



(c) Under the assumptions of Lemma B, prove Eq. (21). 

► 7. [ M24 ] Give a proof of the reciprocity law (19), when c = 0, by using the general 
reciprocity law of exercise 1.2.4-45. 

► 8. [M34] (L. Carlitz.) Let 


^, = 12 £ ((f 

0 <i<r 



By generalizing the method of proof used in Lemma B, prove the following beautiful 
identity due to H. Rademacher: If each of p, q, r is relatively prime to the other two, 


p{P, q, r) + p(q , r, p) + p{r , p, q) = — + — + — — 3. 

qr rp pq 


(The reciprocity law for Dedekind sums, with c = 0, is the special case r = 1.) 

9. [M40] Is there a simple proof of Rademacher’s identity (exercise 8) along the lines 
of the proof in exercise 7 of a special case? 
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10. [M20] Show that when 0 < h < k it is possible to express a{k ■— h, k, c ) and 
a(h, k, —c) easily in terms of cr(/i, k, c). 

11. [M30] The formulas given in the text show us how to evaluate a(h, k, c ) when h 
and k are relatively prime and c is an integer. For the general case, prove that 

a) a(dh, dk, dc ) = a(h, k, c ), integer d > 0; 

b) a(h, k,c J rd) = a(h, k, c ) + 6 ((h'c/k)), integer c, real 0 < 6 < 1, when h and k 
are relatively prime and hh! = 1 (modulo k). 

12. [M24] Show that if h is relatively prime to k and c is an integer, \a(h,k,c)\ < 
(k-l){k-2)/k. 

13 . [M24] Generalize Eq. (26) so that it gives an expression for o(h, k, c ). 

► 14 . [M20] The linear congruential generator that has m — 2 , a = 2 +1, c — 1, 

was given the serial correlation test on three batches of 1000 consecutive numbers, and 
the result was a very high correlation, between 0.2 and 0.3, in each case. What is the 
serial correlation of this generator, taken over all 2 35 numbers of the period? 

15 . [M21} Generalize Lemma B so that it applies to all real values of c, 0 < c < k. 

16 . [M24] Given the Euclidean tableau defined in (33), let po = 1, p\ = ai, and 
p 3 = a,jPj —i -f- pj —2 for 1 < j < t. Show that the complicated portion of the 
sum in Theorem D can be rewritten as follows, making it possible to avoid noninteger 
computations: 


£ (-i ) }+1 

i <j<t 




1 

m\ 


(—i y +i bj( c j + Cj+i)pj~i. 

<3<t 


[Hint: Prove that we have cjcA —l) J+1 / m j m j'+i = (—l) r+1 p r —i/mim r +i for 

1 < r < t.\ 

17 . [M22\ Design an algorithm that evaluates a(h, k, c) for integers h, k, c satisfying 
the hypotheses of Theorem D. Your algorithm should use only integer arithmetic (of 
unlimited precision), and it should produce the answer in the form A + B/k where A 
and B are integers. (Cf. exercise 16.) If possible, use only a finite number of variables 
for temporary storage, instead of maintaining arrays such as ai, < 22 , ..., a t . 

► 18 . [M23\ (U. Dieter.) Given positive integers h, k, z, let 



Show that this sum can be evaluated in “closed form,” in terms of generalized Dedekind 
sums and the sawtooth function. [Hint: When z < k, the quantity [j/k\ — [(j — z)/fcj 
equals 1 for 0 < j < 2 , and it equals 0 for 2 < j < k, so we can introduce this factor 
and sum over 0 < j < k.] 

► 19. [ M23} Show that the serial test can be analyzed over the full period, in terms of 
generalized Dedekind sums, by finding a formula for the probability that a < X n < (5 
and a' < X n+ i < ft when a, ft a', ft 1 are given integers with 0 < a < P < m, 
0 < a 1 < ft < m. [Hint: Consider the quantity [(x — a)/m\ — [{x — ft)/m\.\ 

20. [M29] (U. Dieter.) Extend Theorem P by obtaining a formula for the probability 
that X n > Xn+i > X n + 2 , in terms of generalized Dedekind sums. 
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EXERCISES—Second Set 


In many cases, exact computations with integers are quite difficult to carry out, but 
we can attempt to study the probabilities that arise when we take the average over all 
real values of x instead of restricting the calculation to integer values. Although these 
results are only approximate, they shed some light on the subject. 

It is convenient to deal with numbers U n between zero and one; for linear con- 
gruential sequences, U n = X n /m , and we have C/ n +i = {aU n + 9}, where 9 = c/m 
and {a:} denotes a: mod 1. For example, the formula for serial correlation now becomes 


C = 


(/ x{ax-\-9}dx — 




► 21. [ HM28} (R. R. Coveyou.) What is the value of C in the formula just given? 

► 22. [M22] Let a be an integer, and let 0 < 9 < 1. If a; is a real number between 0 
and 1, and if s(x) = {ax + 9}, what is the probability that $(x) < x ? (This is the 
“real number” analog of Theorem P.) 

23. [28] The previous exercise gives the probability that C / n +1 < U n - What is the 
probability that U n + 2 < U n +1 < U n , assuming that U n is a random real number 
between zero and one? 

24. [M29] Under the assumptions of the preceding problem, except with 9 = 0, show 
that U n > U n +1 > • * ■ > U n -\-t —i occurs with probability 


1 

t\ 




What is the average length of a descending run starting at U n , assuming that U n is 
selected at random between zero and one? 

► 25 . [ M25] Let a, (3, a', (3' be real numbers with 0 < a < (3 < 1 , 0 < a' < < 1 . 

Under the assumptions of exercise 22, what is the probability that a < x < (3 and 
a' < s(x) < (3 '? (This is the “real number” analog of exercise 19.) 

26 . [M21] Consider a “Fibonacci” generator, where U n +i = {U n -\-U n — i}. Assuming 
that Ui and U 2 are independently chosen at random between 0 and 1, find the prob¬ 
ability that U\ < 1/2 < U 3 , U\ < 1/3 < U 2 , U 2 < Ui < 1 / 3 , etc. [Hint: Divide 
the “unit square,” i.e., the points of the plane { (x, y) j 0 < x, y < 1}, into six parts, 
depending on the relative order of x, y, and {x -j - y}, and determine the area of each 
part.] 

27 . [MS2] In the Fibonacci generator of the preceding exercise, let Uq and Ui be 
chosen independently in the unit square except that Uq > U\. Determine the probabil¬ 
ity that U\ is the beginning of an upward run of length k, so that Uq > U\ < • • ■ < 
Uk > Uk+i . Compare this with the corresponding probabilities for a random sequence. 

28 . [MSS] According to Eq. 3.2.1.3-5, a linear congruential generator with potency 2 
satisfies the condition X n —1 — 2X n + X n +i = (a — l)c (modulo m). Consider a 
generator that abstracts this situation: let U n +1 = {a + 2 U n — U n — 1 }. As in 
exercise 26, divide the unit square into parts that show the relative order of U\, C/ 2 , 
and C /3 for each pair (C/i, C/ 2 ). Are there any values of a for which all six possible orders 
are achieved with probability assuming that Ui and C /2 are chosen at random in the 
unit square? 




3.3.4 


THE SPECTRAL TEST 


89 


3.3.4. The Spectral Test 

In this section we shall study an especially important way to check the quality of 
linear congruential random number generators; not only do all good generators 
pass this test, all generators now known to be bad actually fail it. Thus it is 
by far the most powerful test known, and it deserves particular attention. Our 
discussion will also bring out some fundamental limitations on the degree of ran¬ 
domness we can expect from linear congruential sequences and their generaliza¬ 
tions. 

The spectral test embodies aspects of both the empirical and theoretical 
tests studied in previous sections: it is like the theoretical tests because it deals 
with properties of the full period of the sequence, and it is like the empirical 
tests because it requires a computer program to determine the results. 

A. Ideas underlying the test. The most important randomness criteria seem 
to rely on properties of the joint distribution of t consecutive elements of the 
sequence, and the spectral test deals directly with this distribution. If we have 
a sequence (U n ) of period to, the idea is to analyze the set of all to points 


{ U n - )— 1» • • • ) Un-\-t — l) } 



in ^-dimensional space. 

For simplicity we shall assume that we have a linear congruential sequence 
(Xo,a,c,m) of maximum period length to (so that c 7 ^ 0 ), or that to is prime 
and c — 0 and the period length is to — 1. In the latter case we shall add the 
point ( 0 , 0 ,..., 0 ) to the set ( 1 ), so that there are always to points in all; this 
extra point has a negligible effect when to is large, and it makes the theory much 
simpler. Under these assumptions, ( 1 ) can be rewritten as 


I 


— (x,s(z),s(s(x)),. ..,s* Hx)) 
to 


0 < X < TO >, 


( 2 ) 


where 

s(x) — {ax -}- c) mod to (3) 

is the “successor” of x. Note that we are considering only the set of all such points 
in t dimensions, not the order in which those points are actually generated. But 
the order of generation is reflected in the dependence between components of the 
vectors; and the spectral test studies such dependence for various dimensions t 
by dealing with the totality of all points ( 2 ). 

For example, Fig. 8 shows a typical small case in 2 and 3 dimensions, for 
the generator with 

s(x) = (137x + 187) mod 256. (4) 

Of course a generator with period length 256 will hardly be random, but 256 is 
small enough that we can draw the diagram and gain some understanding before 
we turn to the larger to’s that are of practical interest. 
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(a) 

Fig. 8. (a) The two-dimensional grid formed 
by all pairs of successive points (X n ,X n +i), 
when X n +i = (137X n + 187) mod 256. 



(b) The three-dimensional grid of triplets (X n ,X n - |_i,X n + 2 ). [Illustrations courtesy of 
Bruce G. Baumgart.] 


Perhaps the most striking thing about the pattern of boxes in Fig. 8 is that 
we can cover them all by a fairly small number of parallel lines; indeed, there are 
many different families of parallel lines that will hit all the points. For example, 
a set of 20 nearly vertical lines will do the job, as will a set of 21 lines that tilt 
upward at roughly a 30P angle. We commonly observe similar patterns when 
driving past farmlands that have been planted in a systematic manner. 

If the same generator is considered in three dimensions, we obtain 256 points 
in a cube, obtained by appending a “height” component s(s(x)) to each of the 
256 points (x, s(z)) in the plane of Fig. 8(a), as shown in Fig. 8(b). Let’s imagine 
that this 3-D crystal structure has been made into a physical model, a cube that 
we can turn in our hands; as we rotate it, we will notice various families of 
parallel planes that encompass all of the points. In the words of Wallace Givens, 
the random numbers stay “mainly in the planes.” 

At first glance we might think that such systematic behavior is so nonrandom 
as to make congruential generators quite worthless; but more careful reflection, 
remembering that m is quite large in practice, provides a better insight. The 
regular structure in Fig. 8 is essentially the “grain” we see when examining 
our random numbers under a high-power microscope. If we take truly random 
numbers between 0 and 1, and round or truncate them to finite accuracy so 
that each is an integer multiple of \jv for some given number v, then the t- 
dimensional points (1) we obtain will have an extremely regular character when 
viewed through a microscope. 

Let I/V 2 be the maximum distance between lines, taken over all families 
of parallel straight lines that cover the points {(x/m, s(x)/m)} in two dimen¬ 
sions. We shall call U 2 the two-dimensional accuracy of the random number 
generator, since the pairs of successive numbers have a fine structure that is 
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essentially good to one part in u 2 . Similarly, let 1 / 1/3 be the maximum distance 
between planes, taken over ail families of parallel planes that cover all points 
{(x/ra, s(x)/m, s(s(x))/ra)}; we shall call ^3 the accuracy in three dimensions. 
The t-dimensional accuracy v t is the reciprocal of the maximum distance between 
hyperplanes, taken over all families of parallel (t — l)-dimensional hyperplanes 
that cover all points {(x/m, s(x)/m, ..., s l ~ 1 (x)/m)}. 

The essential difference between periodic sequences and truly random se¬ 
quences that have been truncated to multiples of \/v is that the “accuracy” 
of truly random sequences is the same in all dimensions, while the “accuracy” 
of periodic sequences decreases as t increases. Indeed, since there are only m 
points in the ^-dimensional cube when m is the period length, we can’t achieve 
a ^-dimensional accuracy of more than about m 1 ^. 

When the independence of t consecutive values is considered, computer- 
generated random numbers will behave essentially as if we took truly random 
numbers and truncated them to lg v t bits, where V t decreases with increasing t. 
In practice, such varying accuracy is usually all we need. We don’t insist that the 
10 -dimensional accuracy be 2 35 , in the sense that all ( 2 35 ) 10 possible 10 -tuples 
( U n , U n ~ )-i,..., U n - (_g) should be equally likely on a 35-bit machine; for such large 
values of t we want only a few of the leading bits of ( U n , C/ n +i,..., Un+t— 1 ) to 
behave as if they were independently random. 

On the other hand when an application demands high resolution of the 
random number sequence, simple linear congruential sequences will necessarily be 
inadequate; a generator with larger period should be used instead, even though 
only a small fraction of the period will actually be generated. Squaring the period 
will essentially square the accuracy in higher dimensions, i.e., it will double the 
effective number of bits of precision. 

The spectral test is based on the values of v t for small t, say 2 < t < 6 . 
Dimensions 2, 3, and 4 seem to be adequate to detect important deficiencies in 
a sequence, but since we are considering the entire period it seems best to be 
somewhat cautious and go up into another dimension or two; on the other hand 
the values of v t for t > 10 seem to be of no practical significance whatever. 
(This is fortunate, because it appears to be rather difficult to calculate v t when 
t > 10.) 

Note that there is a vague relation between the spectral test and the serial 
test; for example, a special case of the serial test, taken over the entire period 
as in exercise 3.3.3 19, counts the number of boxes in each of 64 subsquares of 
Fig. 8 (a). The main difference is that the spectral test rotates the dots so as to 
discover the least favorable orientation. We shall return to a consideration of 
the serial test later in this section. 

It may appear at first that we should apply the spectral test only for one 
suitably high value of £; if a generator passes the test in three dimensions, it 
seems plausible that it should also pass the 2-D test, hence we might as well omit 
the latter. The fallacy in this reasoning occurs because we apply more stringent 
conditions in lower dimensions. A similar situation occurs with the serial test: 
Consider a generator that (quite properly) has almost the same number of points 
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in each subcube of the unit cube, when the unit cube has been divided into 64 
subcubes of size \ X \ X \\ this same generator might yield completely empty 
subsquares of the unit square, when the unit square has been divided into 64 
subsquares of size gXj. Since we increase our expectations in lower dimensions, 
a separate test for each dimension is required. 

It is not always true that i/ t < m 1 / t , although this upper bound is valid 
when the points form a rectangular grid. For example, it turns out that u 2 = 
\/274 > \/256 in Fig. 8, because a nearly hexagonal structure brings the m points 
closer together than would be possible in a strictly rectangular arrrangement. 

In order to develop an algorithm that computes v t efficiently, we must look 
more deeply at the associated mathematical theory. Therefore a reader who is 
not mathematically inclined is advised to skip to part D of this section, where 
the spectral test is presented as a “plug-in” method accompanied by several 
examples. On the other hand, we shall see that the mathematics behind the 
spectral test requires only some elementary manipulations of vectors. 

Some authors have suggested using the minimum number N t of parallel 
covering lines or hyperplanes as the criterion, instead of the maximum distance 
\/v t between them. However, this number does not appear to be as important 
as the concept of accuracy defined above, because it is biased by how nearly 
the slope of the lines or hyperplanes matches the coordinate axes of the cube. 
For example, the 20 nearly vertical lines that cover all the points of Fig. 8 are 
actually l/\/ 328 units apart, and this might falsely imply an accuracy of one 
part in \/328, or perhaps even of one part in 2Q. The true accuracy of only one 
part in y/214 is realized only for the larger family of 21 lines with a slope of 7/15; 
another family of 24 lines, with a slope of —11/13, also has a greater inter-line 
distance than the 20-line family, since 1 /\/290 >1/\/328. The precise way in 
which families of lines act at the boundaries of the unit hypercube does not seem 
to be an especially “clean” or significant criterion; however, for those people who 
prefer to count hyperplanes, it is possible to compute N t using a method quite 
similar to the way in which we shall calculate v t (see exercise 16). 

*B. Theory behind the test. In order to analyze the basic set (2), we start with 
the observation that 


-«*(*) = ( aJ *+( i + a +---+ Qj modL 

mV m I 


We can get rid of the “mod 1” operation by extending the set periodically, 
making infinitely many copies of the original ^-dimensional hypercube, proceed¬ 
ing in all directions. This gives us the set 


L = 



m 


+ ^2, 



integer x,ki,k 2 , 




integer x,ki,k 2 , 
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where 

Vo = —(0, c, (1 + a)c ,..., (1 -f- a + • • • + at 2 ) c ) (6) 

m v 

is a constant vector. The variable fci is redundant in this representation of L, be¬ 
cause we can change (x, k\, k 2 , ..., k t ) to (x-\-kim, 0, k 2 — ak \,..., k t — a t ~ 1 ki), 
reducing k\ to zero without loss of generality. Therefore we obtain the compara¬ 
tively simple formula 


L — { Vo + yiV\ -|- y 2 V 2 “■ H” ytVt | integer yi,y 2 ,... ,yt}, CO 


where 

Vi = — (l,a,a 2 ,...,a t_1 ); 
m 

V 2 = (0,1,0,, 0), V 3 = (0,0,1,, 0), 


( 8 ) 

Vt = (0,0,0, , 1). (9) 


The points (x 1} x 2 ,..., x t ) of L that satisfy 0 < x 3 < 1 for all j are precisely 
the m points of our original set (2). 

Note that the increment c appears only in Vo, and the effect of Vo is merely 
to shift all elements of L without changing their relative distances; hence c does 
not affect the spectral test in any way, and we might as well assume that Vo = 
(0,0,..., 0) when we are calculating u t . When V 0 is the zero vector we have a 
so-called lattice of points 


Lo = { 3/1 Vi + y 2 V 2 H- lytVt | integer y x , y 2 ,..., y t }, ( 10 ) 

and our goal is to study the distances between adjacent (t — l)-dimensional 
hyperplanes, in families of parallel hyperplanes that cover all the points of L 0 . 

A family of parallel (t — l)-dimensional hyperplanes can be defined by a 
nonzero vector U = (u \,..., u t ) that is perpendicular to all of them; and the set 
of points on a particular hyperplane is then 

{ {xi, •. •, x t ) I X1U1 H-b x t u t — q }, (11) 


where q is a different constant for each hyperplane in the family. In other words, 
each hyperplane is the set of all X for which the dot product X ■ U has a given 
value q. In our case the hyperplanes are all separated by a fixed distance, and 
one of them contains (0,0,..., 0); hence we can adjust the magnitude of U so 
that the set of all integer values q gives all the hyperplanes in the family. Then 
the distance between neighboring hyperplanes is the minimum distance from 
(0, 0,..., 0) to the hyperplane for q = 1, namely 


min 

real x l ,...,x t 



+ 2 


x,\U\ -b • • • ~b x t u t 



Cauchy’s inequality (cf. exercise 1.2.3-30) tells us that 


( 12 ) 


(11U1 +-[- Xtu t ) 2 < (ij H-b xj) (u? H-b Uj )> 


(13) 
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hence the minimum in (12) occurs when each x 3 — u 3 /{u\ + ••• + ); the 

distance between neighboring hyperplanes is 

1 j\Ju\ H- yu\ = l/length([/). (14) 

In other words, the quantity Vt we seek is precisely the length of the shortest 
vedtor U that defines a family of hyperplanes {X-U = q | integer q } containing 
all the elements of L 0 . 

Such a vector U = (iti,..., u t ) must be nonzero, and it must satisfy V • U — 
integer for all V in L 0 . In particular, since the points (1,0,..., 0), (0,1,..., 0), 

..., (0,0,..., 1) are all in L 0 , all of the Uj must be integers. Furthermore since 
Vi is in L 0 , we must have ~(u\ -j- au 2 H-b a± ~ lu t) = integer, i.e., 

Ui -b au 2 + • • • + = 0 (modulo m). (15) 

Conversely, any nonzero integer vector U = («i,..., u t ) satisfying (15) defines a 
family of hyperplanes with the required properties, since all of Lo will be covered: 
(yiVi +•••-)- VtVt) • U will be an integer for all integers yi, ..., y t . We have 
proved that 

— min {wj-j -( -u\ I Ui+au 2 -\ -= 0 (modulo m)} 

= min ((mil- ax 2 —a 2 x 3 - a t ~ 1 x t ) 2 -\-x\-\~x\-\ - b^*)- 

(*^ 1 > ■ • • jXt )7^-(0, — ,0) 

(16) 

C. Deriving a computational method. We have now reduced the spectral test 
to the problem of finding the minimum value (16); but how on earth can we 
determine that minimum value in a reasonable amount of time? A brute-force 
search is out of the question, since m is very large in cases of practical interest. 

It will be interesting and probably more useful if we develop a computational 
method for solving an even more general problem: Find the minimum value of 
the quantity 


f{x i,. • •, x t ) — (unii H- ~b u t\Xt) 2 + * * • ~b {uitxi + • • • -b uttXt ) 2 (17) 

over all nonzero integer vectors (x 1 } ...,x t ), given any nonsingular matrix of 
Coefficients U — (u^ ). The expression (17) is called a “positive definite quadratic 
form” in t variables. Since U is nonsingular, (17) cannot be zero unless the Xj 
are all zero. 

Let us write Ui, ..., U t for the rows of U. Then (17) may be written 


f{x i, • • • , Xt) — {x\U\ ~b * • • ~b %tUt) • {x\U\ -b • • • ~b %tUt), (18) 

the square of the length of the vector x\U\-\ - \-x t U t . The nonsingular matrix 

U has an inverse, which means that we can find uniquely determined vectors 
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Vi,...,Vt such that 


Ui-Vj = Sij, 1 (19) 

For example, in the special form (16) that arises in the spectral test, we have 

Ui = ( m,0,0,... ,0), Vi = 4;(l,o,a 2 ,...,o* _1 ), 

U 2 = { —a, 1,0,..., 0), V 2 = (0,1, 0,..., 0), 

U 3 = ( —a 2 ,0,1,....0), F 3 = (0,0, 1,..., 0), (20) 


U t = (—a* 1 ,0,0.1), V t = (0,0, 0,..., 1). 

These V 3 are precisely the vectors (8), (9) that we used to define our original 
lattice L 0 . As the reader may well suspect, this is not a coincidence—indeed, if we 
had begun with an arbitrary lattice Lq, defined by any set of linearly independent 
vectors the argument we have used above can be generalized to 

show that the maximum separation between hyperplanes in a covering family 
is equivalent to minimizing (17), where the coefficients Uij are defined by (19). 
(See exercise 2.) 

Our first step in minimizing (18) is to reduce it to a finite problem, i.e., to 
show that we won’t need to test infinitely many vectors (xi,..., x t ) to find the 
minimum. This is where the vectors Vi,..., V t come in handy; we have 


Xfc — {x\Ui + • • • -f- XfUt) • 14, 


and Cauchy’s inequality tells us that 

{(xiUt + ■ ■ ■ + x t U t ) ■ V k f < f(xj,...,x t )(v k -v k ). 

Hence we have derived a useful upper bound on each coordinate x 

Lemma A. Let (xi,...,x t ) be a nonzero vector that minimizes (18) and let 
(yi ,..., y t ) be any nonzero integer vector. Then 

4 < (V k ■ V k )f(yi . y t ), for 1 < k < t. (21) 

In particular, letting y l — 6 XJ for all i, 

xl<<yk-V k )(U r Uj), for 1 < j,k < t. | (22) 


Lemma A reduces the problem to a finite search, but the right-hand side of 
(21) is usually much too large to make an exhaustive search feasible; we need at 
least one more idea. On such occasions, an old maxim provides sound advice: “If 
you can’t solve a problem as it is stated, change it into a simpler problem that 
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has the same answer.” For example, Euclid’s algorithm has this form; if we don’t 
know the gcd of the input numbers, we change them into smaller numbers having 
the same gcd. (In fact, a slightly more general approach probably underlies the 
discovery of nearly all algorithms: “If you can’t solve a problem directly, change 
it into one or more simpler problems, from whose solution you can solve the 
original one.”) 

In our case, a simpler problem is one that requires less searching because the 
right-hand side of (22) is smaller. The key idea we shall use is that it is possible 
to change one quadratic form into another one that is equivalent for all practical 
purposes. Let j be any fixed subscript, 1 < j < £; let (qi,..., qj—i, Qj+i ,• •., qt) 
be any sequence of t — 1 integers; and consider the following transformation of 
the vectors: 

Vi = Vi — qtVj, x[ = Xi — q,x 3 , U,' = U i} for i ^ j; 

V/ = V it = W = Uj + Y:i^qiUi. ( } 

It is easy to see that the new vectors U\ , ..., U t ' define a quadratic form /' 
for which f'{x[ ,..., x' t ) — f(x i,..., xt); furthermore the basic orthogonality 
condition (19) remains valid, because it is easy to check that U/ • V/ = 6ij. As 
(xi, ..., Xt) runs through all nonzero integer vectors, so does (x^, ..., x' t ); hence 
the new form f has the same minimum as /. 

Our goal is to use transformation (23), replacing Ui by [// and Vi by V/ for 
all i, in order to make the right-hand side of (22) small; and the right-hand side 
of (22) will be small when both U 3 • Uj and Vk • 14 are small. Therefore it is 
natural to ask the following two questions about the transformation (23): 

a) What choice of Qi makes V/ • Vi as small as possible? 

b) What choice of qi, ..., Qj—\ y Qj+i> .. •, qt makes Uj • Uj' as 
small as possible? 

It is easiest to solve these questions first for real values of the qi. Question (a) 
is quite simple, since 

t V - qiVj) ■ (V, - qiVj) = V, ■ V - 2 Qi V { ■ V, + q\ V., ■ Vj 

= (Vi ■ Vj) (* - (V ■ Vj / Vj ■ Vj)) 2 
+ Vi-Vi- (Vi ■ Vj) 2 / Vj • Vj, 

and the minimum occurs when 

Qi — Vi- Vj ! Vj ■ Vj. (24) 

Geometrically, we are asking what multiple of Vj should be subtracted from Vi 
so that the resulting vector V/ has minimum length, and the answer is to choose 
qi so that Vi is perpendicular to Vj (i.e., so that V/ • Vj ~ 0); the following 
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diagram makes this plain. 



Turning to question (b), we want to choose the q l so that U 3 - <7* £4 has 
minimum length; geometrically, we want to start with Uj and add some vector 
in the (t — l)-dimensional hyperplane whose points are the sums of multiples 
of { U % | i 7 ^ j }. Again the best solution is to choose things so that u/ is 
perpendicular to the hyperplane, i.e., so that U/ • 74 = 0 for all k ^ j, i.e., 

Uj -Uk + Y, Qi(Ui -U k ) = 0 , 1 < k < t, k^j. (26) 

(See exercise 12 for a rigorous proof that a solution to question (b) must satisfy 
these t — 1 equations.) 

Now that we have answered questions (a) and (b), we are in a bit of a 
quandary; should we choose the qi according to (24), so that the V x • V/ are 
minimized, or according to (26), so that U/ ■ U/ is minimized? Either of these 
alternatives makes an improvement in the right-hand side of ( 22 ), so it is not 
immediately clear which choice should get priority. Fortunately, there is a very 
simple answer to this dilemma: Conditions (24) and (26) are exactly the same! 
(See exercise 7.) Therefore questions (a) and (b) have the same answer; we have 
a happy state of affairs in which we can reduce the length of both the U 's and 
the V’s simultaneously. (It may be worthwhile to point out that we have just 
rediscovered the “Schmidt orthogonalization process.”) 

Our joy must be tempered with the realization that we have dealt with 
questions (a) and (b) only for real values of the qi. Our application restricts us 
to integer values, so we cannot make Vi exactly perpendicular to V 3 . The best 
we can do for question (a) is to let q l be the nearest integer to V x - V 3 / Vj -Vj 
(cf. (25)). It turns out that this is not always the best solution to question (b); 
in fact U/ may at times be longer than Uj. However, the bound ( 21 ) is never 
increased, since we can remember the smallest value of f(yi, ■ ■ ■ ,yt) found so far. 
Thus a choice of q % based solely on question (a) is quite satisfactory. 

If we apply transformation (23) repeatedly in such a way that none of the 
vectors Vi gets longer and at least one gets shorter, we can never get into a 
loop; i.e., we will never be considering the same quadratic form again after a 
sequence of nontrivial transformations of this kind. But eventually we will get 
“stuck,” in the sense that none of the transformations (23) for 1 < j < t 
will be able to shorten any of the vectors Vi, ..., V t . At that point we can 
revert to an exhaustive search, using the bounds of Lemma A, which will now 
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be quite small in most cases. Occasionally these bounds ( 21 ) will be poor, and 
another type of transformation will usually get the algorithm unstuck again and 
reduce the bounds (see exercise 18). However, transformation (23) by itself has 
proved to be quite adequate for the spectral test; in fact, it has proved to be 
amazingly powerful when the computations are arranged as in the algorithm 
discussed below. 

*D. How to perform the spectral test. Here now is an efficient computational 
procedure that follows from our considerations. R. W. Gosper and U. Dieter 
have observed that it is possible to use the results of lower dimensions to make 
the spectral test significantly faster in higher dimensions. This refinement has 
been incorporated into the following algorithm, together with a significant simp¬ 
lification of the two-dimensional case. 

Algorithm S ( The spectral test). This algorithm determines the value of 


v t — min< yxf + 




Xi + ax2 -I- + a* 1 x t = 0 (modulo m) \ (27) 


) 


for 2 < t < T, given a , m, and T, where 0 < a < m and a is relatively prime 
to m. (The number v t measures the ^-dimensional accuracy of random number 
generators, as discussed in the text above.) All arithmetic within this algorithm 
is done on integers whose magnitudes rarely if ever exceed m 2 , except in step S8; 
in fact, nearly all of the integer variables will be less than m in absolute value 
during the computation. 

When v t is being calculated for t > 3, the algorithm works with two t X t 
matrices U and V 1 whose row vectors are denoted by U t = (un, ..., un) and 
Vi = (vn ,..., vn ) for 1 < i < t. These vectors satisfy the conditions 

Un -|- aui 2 4- \- a l ~ x Uit = 0 (modulo m), 1 < i < £; (28) 

Ui ■ Vj = 6ijm, 1 < i,j < t. (29) 

(Thus the Vj of our previous discussion have been multiplied by m, to ensure 
that their components are integers.) There are three other auxiliary vectors, 
X = (xi,...,x t ), Y = (yi ,..., y t ), and Z = (zi,...,zt). During the entire 
algorithm, r will denote modm and s will denote the smallest upper bound 
for v 2 that has been discovered so far. 

SI. [Initialize.] Set h <— a, h! •<— m, p 1, p' <- 0, r <— a, s <— 1 -f « 2 . (The 
first steps of this algorithm handle the case t = 2 by a special method, 
very much like Euclid’s algorithm; we will have 


h — ap = h' — ap' = 0 (modulo m) and hp' — h'p — (30) 

during this phase of the calculation.) 

S2. [Euclidean step.] Set q [h'/h\,u +- h' — qh , v <— p' — qp. If u 2 -f v 2 < s, 
set s 4 - u 2 + v 2 , h 1 <r- h, h •<— u, p' +- p, p v, and repeat step S 2 . 
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S3. [Compute v 2 \ Set u <— u — h, v <— v — p; and if u 2 -\- v 2 < s, set 
s +- u 2 -(- v 2 , h’ <— u, p' <— v. Then output y/s = v 2 . (The validity of this 
calculation for the two-dimensional case is proved in exercise 5. Now we 
will set up the U and V matrices satisfying (28) and (29), in preparation 
for calculations in higher dimensions.) Set 


U 




V 






where the — sign is chosen for V if and only if p' > 0. 

54. [Advance t.] If t = T, the algorithm terminates. (Otherwise we want 
to increase t by 1. At this point U and V are t X t matrices satisfying 
(28) and (29), and we must enlarge them by adding an appropriate new 
row and column.) Set t <— t -f 1 and r <— (ar)modm. Set U t to the 
new row (— r, 0 , 0 ,..., 0 , 1) of t elements, and set Uu <— 0 for 1 < i < t. 
Set V t to the new row (0,0,0,..., 0, m). Finally, for 1 < i < t, set q 
round(v 2 i r/m), v lt <— vnr — qm, and Ut <— Ut + qUi> (Here “round(T)” 
denotes the nearest integer to x, e.g., \x -f- 1/2J. We are essentially setting 
Vit *— vnT and immediately applying transformation (23) with j = t, since 
the numbers \vnr\ are so large they ought to be reduced at once.) Finally 
set 5 <— min(s, U t ■ Ut), k <— t, and j <— 1. (In the following steps, j denotes 
the current row index for transformation (23), and k denotes the last such 
index where the transformation shortened at least one of the V^.) 

55. [Transform.] For 1 < i < t, do the following operations: If i ^ j and 
2|W • Vj\ > V 3 ■ Vj, set q round(V< • Vj / Vj • Vj), V< ♦- Vi - qV jt 
Uj <— Uj -j- qUi, and k <— j. (The fact that we omit this transformation, 
when 21 Vi -Vj\ exactly equals V 3 • Vj, prevents the algorithm from looping 
endlessly; see exercise 19.) 

5 6 . [Examine new bound.] If k = j (i.e., if the transformation in S5 has just 
done something useful), set s <— min(s, Uj •Uj ). 

57. [Advance j.] If j = t, set j •<— 1 ; otherwise set j «— j + 1 . Now if j 7 ^ k, 
return to step S5. (If j = k, we have gone through t — 1 consecutive cycles 
of no transformation, so the transformation process is stuck.) 

5 8 . [Prepare for search.] (Now the absolute minimum will be determined, 

using an exhaustive search over all satisfying condition ( 21 ) 

of Lemma A.) Set X <— Y (0,..., 0), set k 4 - t, and set 


VL(V • VWm 2 J 


for 1 < j < t. 


(31) 


(We will examine all X = {x \,..., x t ) with \xj\ < Zj for 1 < j < t. In 
hundreds of applications of this algorithm, no z 3 has yet turned out to be 
greater than 1, nor has the exhaustive search in the following steps ever 
reduced s; however, such phenomena are probably possible in weird cases, 
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especially in higher dimensions. During the exhaustive search, the vector Y 

will always be equal to X\U\ H- \- x t U t , so that f(x x ,..., x t ) = Y • Y. 

Since /(— Xi,...,— x t ) = f(xi,...,x t ), we shall examine only vectors 
whose first nonzero component is positive. The method is essentially that 
of counting in steps of one, regarding (xi,..., x t ) as the digits in a balanced 
number system with mixed radices (2z x + 1,..., 2 z t ~\- 1); cf. Section 4.1.) 

S9. [Advance Xk] If x k = z k , go to Sll. Otherwise increase x k by 1 and set 
Y ^Y + U k . 

510. [Advance k.] Set k <— k-j-1. Then if k < t, set x k <- z k , Y *—Y — 2z k U k , 

and repeat step S10. But if k > t, set s <— min(s,y • Y). 

511. [Decrease k.] Set k <— k — 1. If k > 1, return to S9. Otherwise output 

= y/s (the exhaustive search is completed) and return to S4. | 

In practice Algorithm S is applied for T = 5 or 6, say; it usually works reasonably 
well when T = 7 or 8, but it can be terribly slow when T > 9 since the exhaustive 
search tends to make the running time grow as 3 T . (If the minimum value v t 
occurs at many different points, the exhaustive search will hit them all; hence 
we typically find that all z k = 1 for large t. As remarked above, the values of v t 
are generally irrelevant for practical purposes when t is large.) 

An example will help to make Algorithm S clear. Consider the linear 
congruential sequence defined by 

m = 1() 10 , a = 3141592621, c = 1, X 0 = 0. (32) 

Six cycles of the Euclidean algorithm in steps S2 and S3 suffice to prove that the 
minimum nonzero value of x\ -j- x\ with 

x x + 3141592621x2 = 0 (modulo 10 10 ) 

occurs for x x = 67654, X 2 = 226; hence the two-dimensional accuracy of this 
generator is 

u 2 = \f 67654 2 + 226 2 « 67654.37748. 

Passing to three dimensions, we seek the minimum nonzero value of 
such that 

X\ -f 3141592621x2 + 3141592621 2 x 3 = 0 (modulo 10 10 ). (33) 

Step S4 sets up the matrices 

/ —67654 —226 0A /—191 —44190611 2564918569A 

U = —44190611 191 0 , V = I —226 67654 1307181134 . 

V 5793866 33 1/ VO 0 10000000000/ 

The first iteration of step S5, with q — 1 for i = 2 and q = 4 for i — 3, changes 
them to 

/—21082801 97 4\ /—191 —44190611 2564918569A 

U — —44190611 191 0 , V = [ —35 44258265 —1257737435 . 

V 5793866 33 1/ V 764 176762444 —259674276/ 
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(Note that the first row U\ has actually gotten longer in this transformation, 
although eventually the rows of U should get shorter.) 

The next fourteen iterations of step S5 have (j, qi, < 72 , Q3) = (2,— 2,*,0), 
(3,0,3,*), (1,*,-10,-1), (2,-1,*,-6), (3,-1,0,*), (1, *,0,2), (2,0, *, —1), 
(3,3,4, *), (1, *, 0,0), (2, —5, *, 0), (3,1,0, *), (1, *, —3, —1), (2,0, *, 0), (3,0,0, *). 
Now the transformation process is stuck, but the rows of the matrices have 
become significantly shorter: 

/—1479 616 —2777\ f —888874 601246 —2994234A 

U = —3022 104 918 , V = I —2809871 438109 1593689 . (34) 

V —227 —983 —130/ V —854296 —9749816 —1707736/ 

The search limits {zi,z 2 ,z 3 ) in step S8 turn out to be (0,0,1), so U 3 is the 
shortest solution to (33); we have 

v 3 = \f 227 2 + 983 2 + 130 2 « 1017.21089. 


Note that only a few iterations were needed to find this value, although condition 
(33) looks quite difficult to deal with at first glance. All points (U n , £/ n +i, f^n+ 2 ) 
produced by this random number generator lie on a family of parallel planes 
about 0.001 units apart. 

E. Ratings for various generators. So far we haven’t really given a criterion 
that tells us whether or not a particular random number generator “passes” or 
“flunks” the spectral test. In fact, this depends on the application, since some 
applications demand higher resolution than others. It appears that v t > 2 30 / i 
for 2 < t < 6 will be quite adequate in most applications (although the author 
must admit choosing this criterion partly because 30 is conveniently divisible by 
2, 3, 5, and 6). 

For some purposes we would like a criterion that is relatively independent 
of m, so we can say that a particular multiplier is good or bad with respect to 
the set of all other multipliers for the given m, without examining any others. A 
reasonable figure of merit for rating the goodness of a particular multiplier seems 
to be the volume of the ellipsoid in t-space defined by the relation (x\m — x 2 a — 
• • • — Xtd 1 — l ) 2 + x\ -|- • • • + %t < v ti si nce this volume tends to indicate how 
likely it is that nonzero integer points (xi,..., x t )—corresponding to solutions of 
(15)—are in the ellipsoid. We therefore propose to calculate this volume, namely 


7T t//2 V\ 


(35) 


as an indication of the effectiveness of the multiplier a for the given m. In this 
formula, 


2 




for t odd. 


(36) 
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Table 1 

SAMPLE RESULTS OF THE SPECTRAL TEST 


Line 

a 

m 

^2 


A 

v l 


1 



530 

530 

530 

530 

447 

2 


2 35 

16642 

16642 

16642 

15602 

252 

3 


2 35 

34359738368 

6 

4 

4 

4 

4 

3141592653 

2 35 

2997222016 

1026050 

27822 

1118 

1118 

5 

137 

256 

274 

30 

14 

6 

4 

1 6 

3141592621 


4577114792 

1034718 

62454 

1776 

542 

7 

3141592221 


4293881050 

276266 

97450 

3366 

2382 

8 

4219755981 


10721093248 

2595578 

49362 

5868 

820 

9 

4160984121 


9183801602 

4615650 

16686 

6840 

1344 

10 

3141592221 

2 35 

13539813818 

5795090 

88134 

12716 

2938 

11 

2718281829 

2 35 

22939188896 

2723830 

146116 

10782 

2914 

12 

5 13 

2 35 

33161885770 

2925242 

113374 

13070 

2256 

13 

5 15 

2 35 

22078865098 

10274746 

167558 

5844 

2592 

14 

2 23 + 2 12 + 5 

2 35 

167510120 

8052254 

21476 

16802 

1630 

15 

2 23 + 2 13 + 5 

2 35 

168231328 

5335322 

21476 

2008 

1134 

16 

2 23 + 2 14 + 5 

2 35 

12256151168 

5733878 

21476 

13316 

2032 

17 

2 22 + 2 13 -f-5 

2 35 

8201443840 

1830230 

21476 

7786 

3080 

18 

2 24 + 2 13 + 5 

2 35 

8364058 

8364058 

21476 

16712 

1496 

19 

19935388837 

2 35 

32300850938 

705518 

22270 

9558 

2660 

20 

1175245817 

2 35 

36436418002 

7362242 

95306 

3006 

2860 

21 

17059465 

2 35 

39341117000 

9476606 

202796 

18758 

2382 

22 

2 16 + 3 

2 29 

536805386 

118 

116 

116 

116 

23 

1812433253 

2 32 

4326934538 

1462856 

15082 

4866 

906 

24 

1566083941 

2 32 

4659748970 

2079590 

44902 

4652 

662 

25 

69069 

2 32 

4243209856 

2072544 

52804 

6990 

242 

26 

1664525 

2 32 

4938916874 

2322494 

63712 

4092 

1038 

27 

314159269 

2 31 — 1 

1432232969 

899290 

36985 

3427 

1144 

28 

see (39) 

(2 31 — l) 2 

1.4 X 10 12 

643578623 

12930027 

837632 

29 

31167285 

2 48 

3.2 X 10 14 

4111841446 

17341510 

306326 

59278 

30 

see the text 

2 64 

8.8 X 10 18 

6.4 X 10 12 

4.1 X 10 9 

45662836 

1846368 


Thus, in six or fewer dimensions the merit is computed as follows: 

M 2 = fi 3 = fiA = ^ 

Vs = ^ 

We might say that the multiplier a passes the spectral test if is 0.1 or more 
for 2 < t < 6, and it “passes with flying colors” if /z t > 1 for all these t. A low 
value of jj, t means that we have probably picked a very unfortunate multiplier, 
since very few lattices will have integer points so close to the origin. Conversely, 
a high value of fi t means that we have found an unusually good multiplier for 
the given m\ but it does not mean that the random numbers are necessarily very 
good, since m might be too small. Only the values v t truly indicate the degree 
of randomness. 
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lg^2 

lg^3 

lg ^4 

lg ^5 

Igt'e 

M2 

M3 

M4 

Ms 

Me 

Line 

4.5 

4.5 

4.5 

4.5 

4.4 

2e 5 

5e 4 

0.01 

0.34 

4.62 

1 

7.0 

7.0 

7.0 

7.0 

4.0 

2e 6 

3e 4 

0.04 

4.66 

2e 3 

2 

17.5 

1.3 

1.0 

1.0 

1.0 

3.14 

2e 9 

2e 9 

5e 9 

e 8 

3 

15.7 

10.0 

7.4 

5.0 

5.0 

0.27 

0.13 

0.11 

0.01 

0.21 

4 

4.0 

2.5 

1.9 

1.3 

1.0 

3.36 

2.69 

3.78 

1.81 

1.29 

5 

16.0 

10.0 

8.0 

5.4 

4.5 

1.44 

0.44 

1.92 

0.07 

0.08 

6 

16.0 

9.0 

8.3 

5.9 

5.6 

1.35 

0.06 

4.69 

0.35 

6.98 

7 

16.7 

10.7 

7.8 

6.3 

4.8 

3.39 

1.75 

1.20 

1.39 

0.28 

8 

16.5 

11.1 

7.0 

6.4 

5.2 

2.89 

4.15 

0.14 

2.04 

1.25 

9 

16.8 

11.2 

8.2 

6.8 

5.8 

1.24 

1.70 

1.12 

2.79 

3.81 

10 

17.2 

10.7 

8.6 

6.7 

5.8 

2.10 

0.55 

3.15 

1.85 

3.72 

11 

17.5 

10.7 

8.4 

6.8 

5.6 

3.03 

0.61 

1.85 

2.99 

1.73 

12 

17.2 

11.6 

8.7 

6.3 

5.7 

2.02 

4.02 

4.03 

0.40 

2.62 

13 

13.7 

11.5 

7.2 

7.0 

5.3 

0.02 

2.79 

0.07 

5.61 

0.65 

14 

13.7 

11.2 

7.2 

5.5 

5.1 

0.02 

1.50 

0.07 

0.03 

0.22 

15 

16.8 

11.2 

7.2 

6.9 

5.5 

1.12 

1.67 

0.07 

3.13 

1.26 

16 

16.5 

10.4 

7.2 

6.5 

5.8 

0.75 

0.30 

0.07 

0.82 

4.39 

17 

11.5 

11.5 

7.2 

7.0 

5.3 

00 

2.95 

0.07 

5.53 

0.50 

18 

17.5 

9.7 

7.2 

6.6 

5.7 

2.95 

0.07 

0.07 

1.37 

2.83 

19 

17.5 

11.4 

8.3 

5.8 

5.7 

3.33 

2.44 

1.30 

0.08 

3.52 

20 

17.6 

11.6 

8.8 

7.1 

5.6 

3.60 

3.56 

5.91 

7.38 

2.03 

21 

14.5 

3.4 

3.4 

3.4 

3.4 

3.14 

e 5 

e 4 

e 3 

0.02 

22 

16.0 

10.2 

6.9 

6.1 

4.9 

3.16 

1.73 

0.26 

2.02 

0.89 

23 

16.1 

10.5 

7.7 

6.1 

4.7 

3.41 

2.92 

2.32 

1.81 

0.35 

24 

16.0 

10.5 

7.8 

6.4 

4.0 

3.10 

2.91 

3.20 

5.01 

0.02 

25 

16.1 

10.6 

8.0 

6.0 

5.0 

3.61 

3.45 

4.66 

1.31 

1.35 

26 

15.2 

9.9 

7.6 

5.9 

5.1 

2.10 

1.66 

3.14 

1.69 

3.60 

27 

31.0 

20.2 

15.6 

11.8 

9.8 

3.14 

1.49 

0.44 

0.69 

0.66 

28 

24.1 

16.0 

12.0 

9.1 

7.9 

3.60 

3.92 

5.27 

0.97 

3.82 

29 

31.5 

21.3 

16.0 

12.7 

10.4 

1.50 

3.68 

4.52 

4.02 

1.76 

30 


upper bounds from (40): 

3.63 

5.92 

9.87 

14.89 

23.87 



Table 1 shows what sorts of values occur in typical sequences. Each line 
of the table considers a particular generator, and lists v t , fi t , and the “number 
of bits of accuracy” Ig v t . Lines 1 through 4 show the generators that were the 
subject of Figs. 2 and 5 in Section 3.3.1. The generators in lines 1 and 2 suffer 
from too small a multiplier; a diagram like Fig. 8 will have a nearly vertical 
“stripes” when a is small. The terrible generator in line 3 has a good /i 2 but 
very poor // 3 and /i 4 ; like nearly all generators of potency 2 , it has v 3 = y/Q 
and v 4 = 2 (see exercise 3). Line 4 shows a “random” multiplier; this generator 
has satisfactorily passed numerous empirical tests for randomness, but it does 
not have especially high values of /i 2 ,..., ix§. In fact, the value of /^ 5 flunks our 
criterion. 

Line 5 shows the generator of Fig. 8 . It passes the spectral test with very 
high-flying colors, when /r 2 through are considered, but of course m is so 
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small that the numbers can hardly be called random; the v t values are terribly 
low. 

Line 6 is the generator discussed above; line 7 is a similar example, having 
an abnormally low value of /i 3 . Line 8 shows a nonrandom multiplier for the 
same modulus m; all of its partial quotients are 1 , 2, or 3. Such multipliers have 
been suggested by I. Borosh and H. Niederreiter because the Dedekind sums are 
likely to be especially small and because they produce best results in the two- 
dimensional serial test (cf. Section 3.3.3 and exercise 30). The particular example 
in line 8 has only one ‘3’ as a partial quotient; there is no multiplier congruent to 1 
modulo 20 whose partial quotients with respect to 10 10 are only l’s and 2’s. The 
generator in line 9 shows another multiplier chosen with malice aforethought, 
following a suggestion by A. G. Waterman that guarantees a reasonably high 
value of fi 2 (see exercise 11 ). 

Lines 10 through 21 of Table 1 show further examples with m = 2 35 , 
beginning with some random multipliers. The generators in lines 12 and 13 
are reminders of the good old days—they were once used extensively since O. 
Taussky first suggested them in the early 1950s. Lines 14 through 18 show 
various multipliers of maximum potency having only four l’s in their binary 
representation. The point of having few l’s is to replace multiplication by a few 
additions, but only line 16 comes near to being passable. Since these multipliers 
satisfy (a — 5 ) 3 mod 2 35 = 0, all five of them achieve ^4 at the same point 
(xi,x 2 ,z 3 ,z 4 ) = (-125,75,-15,1). Another curiosity is the high value of /z 3 
following a very low ii 2 in line 18 (see exercise 8 ). Lines 19 and 20 are respectively 
the Borosh-Niederreiter and Waterman multipliers for modulus 2 35 ; and line 21 
was found by M. Lavaux and F. Janssens, in a computer search for spectrally 
good multipliers having a very high fi 2 . 

Lines 22 through 28 apply to System/370 and other machines with 32-bit 
words; in this case the comparatively small word size calls for comparatively 
greater care. Line 22 is, regrettably, the generator that has actually been used 
on such machines in most of the world’s scientific computing centers for about 
a decade; its very name RANDU is enough to bring dismay into the eyes and 
stomachs of many computer scientists! The actual generator is defined by 

X 0 odd, X n+1 = (65539X n ) mod 2 31 , (38) 

and exercise 20 indicates that 2 29 is the appropriate modulus for the spectral test. 
Since 9X n + 6 X n+2 + X n + 2 = 0 (modulo 2 31 ), the generator fails most three- 
dimensional criteria for randomness, and it should never have been used. Almost 
any multiplier = 5 (modulo 8 ) would be better. (A curious fact about RANDU, 
noticed by R. W. Gosper, is that 1/4 = = i/ 7 = i / 8 = vg = \/ll 6 , 

hence iig is a spectacular 11.98.) Lines 23 and 24 are the Borosh-Niederreiter and 
Waterman multipliers for modulus 2 32 , lines 26 and 29 were found by Lavaux and 
Janssens, and line 30 (whose excellent multiplier 6364136223846793005 is too big 
to fit in the column) is due to C. E. Haynes. Line 25 was nominated by George 
Marsaglia as “a candidate for the best of all multipliers,” after a computer search 
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in dimensions 2 through 5, partly because it is easy to remember. Line 27 uses 
a random primitive root, modulo the prime 2 31 — 1, as multiplier. Line 28 is for 
the sequence 

X n = (271828183X n _i — 314159269X n _ 2 ) mod (2 31 — 1), (39) 

which can be shown to have period length (2 31 — l) 2 — 1; it has been analyzed 
with the generalized spectral test of exercise 24. 

Theoretical upper bounds on p t , which can never be transcended for any m, 
are shown just below Table 1; it is known that every lattice with m points per 
unit volume has 

Vt < (40) 


where takes the respective values 

(4/3) 1/4 , 2 1/6 , 2 1/4 , 2 3/1 °, (64/3) 1/12 , 2 3/t , 2 1/2 (41) 

for t = 2, ..., 8. (See exercise 9 and J. W. S. Cassels, Introduction to the 
Geometry of Numbers (Berlin: Springer, 1959), p. 332.) These bounds hold for 
lattices generated by vectors with arbitrary real coordinates. For example, the 
optimum lattice for t = 2 is hexagonal, and it is generated by vectors of length 
2/Vsm that form two sides of an equilateral triangle. In three dimensions the 
optimum lattice is generated by vectors Vi, V 2 , V 3 that can be rotated into the 
form (v, v, — v), (v, — v, v), (— v, v, v), where v — 1 /V 4m. 

*F. Relation to the serial test. In a series of important papers published during 
the 1970s, Harald Niederreiter has shown how to analyze the distribution of 
the ^dimensional vectors (1) by means of exponential sums. One of the main 
consequences of his theory is that the serial test in several dimensions will be 
passed by any generator that passes the spectral test, even when we consider 
only a sufficiently large part of the period instead of the whole period. We shall 
now turn briefly to a study of his interesting methods, in the case of linear 
congruential sequences (Xoof period length m. 

The first idea we need is the notion of discrepancy in t dimensions, a quantity 
that we shall define as the difference between the expected number and the actual 
number of t-dimensional vectors (x n , x n + x n +t—i) falling into a hyper- 

rectangular region, maximized over all such regions. To be precise, let (x n ) be a 
sequence of integers in the range 0 < x n < m. We define 


D$ = max 

JV R 


number of (x n ,..., x n _|_ t —l) in R for 0 < n < N 

N 


where R ranges over all sets of points of the form 


volume of R 

m 1 

(42) 


R = {(yi,..., y t ) | on <2/i < Pi, ■ ■ ■, <2 It < ft }; (43) 


here ctj and f3 3 are integers in the range 0 < otj < (3j < m, for 1 < j < t. The 

volume of R is clearly (/?i — «i)... ((5 t — cx t ). To get the discrepancy D^\ we 
imagine looking at all these sets R and finding the one with the greatest excess 
or deficiency of points (x n ,..., x n+t _i). 
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An upper bound for the discrepancy can be found by using exponential sums. 
Let u = e 2 W m be a primitive rath root of unity. If (x 1} ..., x t ) and (y x ,..., y t ) 
are two vectors with all components in the range 0 < Xj,yj < m, we have 


t~y t )u t = fro*, if (xi,..., x t ) = ( yi ,..., y t ); 
\ 0 , if (xi,...,x t ) 7 ^ ( 2/1 


V'' w (xi—_ Iro , ll (xi,..., Xt) (yi ,..., yt)’, 

os.i.rr«.<m 1°’ 

Therefore the number of vectors (x n ,..., x n _j_ t _i) in R for 0 < n < N, when 
R is defined by (43), can be expressed as 


1 v—r 


-—1 Ut 


0<n<AT 0<ui <m 


x E E 

ai<yi<Pi a t <y t <p t 


(yiuH- \rVtU t ) 


When ui = * • • = u t = 0 in this sum, we get N/m* times the volume of R\ 
hence we can express D$ as 


max T7 —7 

r Nm 1 


— v 

mi « 


-— l u i 


0<n<N 0<'Ui,...,n t <m 


x E - E 

ax<yi</3i a t <y t <Pt 


(y i«iH-l-l/tttt) 


Since complex numbers satisfy \w -\- z\ < |tu| + |z| and |iu 2 | = |iu| |z|, it follows 
that 


< max — 
' r m 


h E E ••• E 


0<ui,...,u t <m l a l <yi</3i ot t <yt<Pt 

(ui 0,...,0) 


<L E 


—- > max 

m t ^ r 

0<«i <m 


^ ... ^ W (yiUl+ "' +ytUt) g(ui,...,u t ) 


f(ui,...,u t )g(ui,...,ut), 


0<Ui,.,.,Ut <m 


where 




.XnlXlH-|-Xn4-1 — l^t 


0<n<N 


/(wi,...,u t ) = max-? 7 V ... V 

r ra t ' z — 1 ' 


(yi^i-t-hyt^t) 


'ai<yi<)9i <*t<yt<Pt 


= max — 
r m 


i v ... — v 


,—u t yt 


<yi<Pi 


a t <y t <Pt 
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Both / and g can be further simplified in order to get a good upper bound on 
D$. We have 


1 E 

m " 


u> 


—uy 


1 (jJ -pu_ UJ -au 

m 


u) 


— u 


— 1 


< 


a<y</3 

when 0, and the sum is < 1 when u = 0; hence 


m\u) u — 1 


m sin(7m/m) 


where 


/(tii,...,i/t) < r(ui,...,u t ), 



• -. «t) = n 


1</C<t 


1 

m smfjntk/m)' 


(45) 

(46) 


Furthermore, when { x n ) is generated modulo m by a linear conguential sequence, 
we have 


• • • -f- — X n U\ -j- (dX n c)ll2 • 

-|- (a 4 1 x n c(a* 2 -(-••• “h l)) w t 

= (ui -\-au 2 H- fa t- 1 u t )x n + h(u lt ... ,u t ) 

where h(u\, ..., u t ) is independent of n; hence 


where 




1 

iV 



^q('Ui,...,ut)x n 


0<n<iV 


q(u \,..., Ut) — U\ -(- au 2 -f- * • • -j- a 4 1 Ut. 


(47) 

(48) 


Now here is where the connection to the spectral test comes in: We will show 
that the sum g(u 1} is rather small unless q(ui ,..., u t ) = 0 (modulo m), 

i.e., unless (ui,..., u t ) is a solution to (15). Furthermore exercise 27 shows that 
r(ui,..., u t ) is rather small when (u i,..., u t ) is a “large” solution to (15). Hence 
the discrepancy D$ will be rather small when (15) has only “large” solutions, i.e., 
when the spectral test is passed. All that remains is to quantify these qualitative 
statements by making careful calculations. 

In the first place, let’s consider the size of g(u\,... ,u t ). When N = m, 
so that the sum (47) is over an entire period, we have g(u \,... ,u t ) = 0 except 
when (wi,..., u t ) satisfies (15), so the discrepancy is bounded above in this case 
by the sum of taken over all the nonzero solutions of (15). But 

let’s consider also what happens in a sum like (47) when N is less than m and 
q(u \,..., ut) is not a multiple of m. We have 


1 

N 



0 <n<N 


1 

N 

1 

N 


E 

0 <n<N 


1 

m 




yXj+jk 


0 < fc < m 0<j<m 


e(;e 

0</c<m V 0<n<iV 


SkOi 


(49) 
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where 

Sid = u x >+ t+ i k . (50) 

0<j<m 

Now Ski = u~ lk Sk o, so \Ski\ = |S* 0 | f° r all U and we can calculate this common 
value by further exponential-summery: 

iAoi 2 =— y\ is w i 2 

m o<T? m 


= * r v w *,+<+j* y 

i i 

0 <l<m 0 <j<m 

= 1 ]T £ 


0 <i<m 
jXj-j-l Xi + l 


0<i,j<m 


0 <l<m 


“s.S 




(j—i)fc 


0<i<m 


E - 

0< l <m 


(a J l — l )xi+i~\-(a J l —l)c/(a—1) 


Let s be minimum such that a s = 1 (modulo m), and let 

s' = (a s — l)c/(a — l)modm. 

Then s is a divisor of m, and x n +j S = x n + jV (modulo m). The sum on l 
vanishes unless j — i is a multiple of s, so we find that 

\S k0 \ 2 = TO Y, ^ sk + jS '. 

0< j <m/s 


We have s' = q's where q' is relatively prime to m (cf. exercise 3.2.1-21), so it 
turns out that 


JO, if k -(- q' ^ 0 (modulo m/s)’, 

\ m/s/s, if k -(- q' = 0 (modulo m/s). 


(51) 


Putting this information back into (49), and recalling the derivation of (45), 
shows that 


1 

N 


E uI " 

0<n<N 



(52) 


where the sum is over 0 < k < m such that k + q' = 0 (modulo m/s ). 
Exercise 25 now can be used to estimate the remaining sum, and we find that 


N 


E <-** 

0 <n<N 






The same bound can be used to estimate |N 1 J2o<n<N ujqXn \ f° r an Y <7 ^ 0 
(modulo m), since the effect is to replace m in this derivation by a divisor of m. 
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In fact, the upper bound gets even smaller when q has a factor in common 
with m, since s and m/y/s generally become smaller. (See exercise 26.) 

We have now proved that the g(u\,... ,u t ) part of our upper bound (44) 
on the discrepancy is small, if N is large enough and if does not 

satisfy the spectral test congruence (15). Exercise 27 proves that the f{ui,... ,u t ) 
part of our upper bound is small, when summed over all the nonzero vectors 
satisfying (15), provided that all such vectors are far away from 
(0,..., 0). Putting these results together leads to the following theorem of 
Niederreiter: 

Theorem N. Let (X n ) be a linear congruential sequence (X 0 ,a,c,m) of period 
length m, and let s be the least positive integer such that a s = 1 (modulo m). 
Let v t be the t-dimensional accuracy of (X n ), as determined by the spectral test. 
Then the t-dimensional discrepancy D$ determined by the first N values of 
(.X n ), as defined in (42), satisfies 

Dff _ ^my,,.)' ) + ofTSSjf ) + oC(logm)> (54) 

D m = 0 ((logm)‘ 7* max )• (55) 

Here r max is the maximum value of the quantity r(ui ,... ,u t ) defined in (46), 
taken over all nonzero integer vectors ( 14 ,..., u t ) satisfying (15). 

Proof. The first two 0 terms in (54) come from vectors (ui, ..., u t ) in (44) that 
do not satisfy (15), since exercise 25 proves that f(ui ,..., u t ) summed over all 
(tii, .. • ,u t ) is 0 ((( 2 /- 7 r)lnm) t ) and exercise 26 bounds each g{u\, .. .,u t ). (These 
terms are missing from (55) since g(ui ,..., u t ) = 0 in that case.) The remaining 
O term in (54) and (55) comes from nonzero vectors (ui ,... ,u t ) that do satisfy 
(15), using the bound derived in exercise 27. (By examining this proof carefully, 
we could replace each O in these formulas by an explicit function of t.) | 

Eq. (55) relates to the serial test in t dimensions over the entire period, 
while Eq. (54) gives us useful information about the distribution of the first N 
generated values when N is less than m, provided that N is not too small. 
Note that (54) will guarantee low discrepancy only when s is sufficiently large, 
otherwise the m/y/s term will dominate. If m = p \ l ... p e r T and gcd(a — 1, m) — 
p{\ ..p( T , then s equals p e /~. .p e r r ~^ r (cf. Lemma 3.2.1.2P); thus, the largest 
values of s correspond to high potency. In the common case m = 2 e and a = 5 
(modulo 8 ), we haves = Jm, so D$ is 0 ( v / m(logm) t_,_ 1 /N’)-[- 0 ((logm)V max ). 
It is not difficult to prove that r max < y/2/v t unless v t is very small (see 
exercise 29); therefore Eq. (54) says in particular that the discrepancy will be 
low in t dimensions if the spectral test is passed and if N is somewhat larger 
than A/m(logm) t+1 . 

In a sense Theorem N is almost too strong, for the result in exercise 30 shows 
that linear congruential sequences like those in lines 8 , 19, and 23 of Table 1 
have a discrepancy of order (log m) 2 /m in two dimensions. The discrepancy 
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in this case is extremely small in spite of the fact that there are parallelogram- 
shaped regions of area « 1 /y/m containing no points (U n ,U n + 1 ). The fact 
that discrepancy can change so drastically when the points are rotated warns us 
that the serial test may not be as meaningful a measure of randomness as the 
rotation-invariant spectral test. 

G. Historical remarks. In 1959, while deriving upper bounds for the error in 
the evaluation of t-dimensional integrals by the Monte Carlo method, N. M. 
Korobov devised a way to rate the multiplier of a linear congruential sequence. 
His formula (which is rather complicated) is related to the spectral test since 
it is strongly influenced by “small” solutions to (15); but it is not quite the 
same. Korobov’s test has been the subject of an extensive literature, surveyed 
by Kuipers and Niederreiter in Uniform Distribution of Sequences (New York: 
Wiley, 1974), §2.5. 

The spectral test was originally formulated by R. R. Coveyou and R. D. 
MacPherson [JACM 14 (1967), 100-119], who introduced it in an interesting 
indirect way. Instead of working with the grid structure of successive points, 
they considered random numbe r generators as sources of ^-dimensional “waves.” 

The numbers \/xf~] -(- x\ such that x\ H-1 - a t ~ 1 x t = 0 (modulo m) in 

their original treatment were the wave “frequencies,” or points in the “spectrum” 
defined by the random number generator, with low-frequency waves being the 
most damaging to randomness; hence the name spectral test. Coveyou and 
MacPherson introduced a procedure analogous to Algorithm S for performing 
their test, based on the principle of Lemma A. However, the original procedure 
(which used matrices UU T and VV T instead of U and V) dealt with extremely 
large numbers; the idea of working directly with U and V was independently 
suggested by F. Janssens and by U. Dieter. 

Several other authors pointed out that the spectral test could be understood 
in far more concrete terms; by introducing the study of the grid and lattice struc¬ 
tures corresponding to linear congruential sequences, the fundamental limitations 
on randomness became graphically clear. See G. Marsaglia, Proc. Nat. Acad. Sci. 
61 (1968), 25-28; W. W. Wood, J. Chem. Phys. 48 (1968), 427; R. R. Coveyou, 
Studies in Applied Math. 3 (1969), 70-112; W. A. Beyer, R. B. Roof, and D. 
Williamson, Math. Comp. 25 (1971), 345-360; G. Marsaglia and W. A. Beyer, 
Applications of Number Theory to Numerical Analysis, ed. by S. K. Zaremba 
(New York: Academic Press, 1972), 249-285, 361-370. 

Harald Niederreiter’s papers concerning the use of exponential sums to study 
the distribution of linear congruential sequences have appeared in Math. Comp. 
26 (1972), 793-795; 28 (1974), 1117-1132; 30 (1976), 571-597; Advances in Math. 
26 (1977), 99-181 [this is the most important paper of the series]; and Bull. Amer. 
Math. Soc. 84 (1978), 273-274. 957-1041 [this one summarizes the others and 
contains an extensive bibliography]. 


EXERCISES 

1. [ M10 ] To what does the spectral test reduce in one dimension? (In other words, 
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what happens when t — 1?) 

2. [ HM 20 ] Let V\, ..., Vt be linearly independent vectors in f-space, let Lo be the 
lattice of points defined by ( 10 ), and let C/i, ..., U t be defined by ( 19 ). Prove that 
the maximum distance between (t — l)-dimensional hyperplanes, over all families of 
parallel hyperplanes that cover L 0 , is l/min{ f(x i ,... ,Xt) \ (zi ,..., x t ) 7^ ( 0 ,..., 0 )}, 
where / is defined in ( 17 ). 

3 . [ M 24 } Determine 1^3 and 1/4 for all linear congruential generators of potency 2 and 
period length m. 

► 4 . [ M 23 } Let un, U12, U21 , U22 be elements of a 2 X 2 integer matrix such that 
U\\ au\2 = U21 + CLU22 = 0 (modulo m) and U11U22 — U21U12 ~ m. (a) Prove 
that all integer solutions (yi, 7/2) to the congruence y\ -f a V2 = 0 (modulo m) have the 
form (yi,y2) — (£1^11 + 22^21, £1^12+ £2^22) for integer xi, X2. (b) If, in addition, 
2\uuU 2 i+Ui2U22\ < ^11+^12 < ^+^22, prove that (yi, y 2 ) =+11, ui2) minimizes 
y\ + y2 over all nonzero solutions to the congruence. 

5 . [ M 30 ] Prove that steps SI through S 3 of Algorithm S correctly perform the 
spectral test in two dimensions. [Hint: See exercise 4 , and prove that ( h' + h) 2 + 
( p 1 + p) 2 > h 2 -f- p 2 at the beginning of step S2.] 

6 . [ M 30 \ Let <2o, fli, ..., at —i be the partial quotients of a/m as defined in Section 
3 . 3 . 3 , and let A = maxo^c* aj. Prove that fi 2 > 27 r/(A+ 1 -f-1 /A). 

7 . [ HM 22 } Prove that “question (a)” and “question (b)” of the text have the same 
solution for real values of qi, ..., q 3 —\, q 3 +\, ..., qt (cf. ( 24 ), ( 26 )). 

8 . [ M 16 ] Line 18 of Table 1 has a very low value of /i 2) yet /a 3 is quite satisfactory. 
What is the highest possible value of when y.2 = 10 —6 and m = 10 10 ? 

9 . [ HM 32 ] (C. Hermite, 1846 .) Let f(xi,...,x t ) be a positive definite quadratic 
form, defined by the matrix U as in ( 17 ), and let 6 be the minimum value of / at 
nonzero integer points. Prove that 9 < (|)^ t—x ^ 2 | det C/ 1 2//t . [Hints: If W is any integer 
matrix of determinant 1 , the matrix WU defines a form equivalent to /; and if S is 
any orthogonal matrix (i.e., S~ 1 = S T ), the matrix US defines a form identically 
equal to /. Show that there is an equivalent form g whose minimum 6 occurs at 
( 1 , 0 ,..., 0 ). Then prove the general result by induction on t, writing g(x 1 , ... ,x t ) = 

9 (x 1 -|- P2X2 H-b PtXt) 2 + h(x2,... ,x t ) where h is a positive definite quadratic form 

in t — 1 variables.] 

10 . [M 28 ] Let ( yi, y 2 ) be relatively prime integers such that y\ -\-ay2 = 0 (modulo m) 
and y\-\-y\ < \J 4 / 3 m. Show that there exist integers (ui, 112) such that ui-\-au2 = 0 
(modulo m), u 1 y 2 — u 2 yi = m, 2 \myi +^2^2! < min(ui+U2>2/i-h/2)» an d {u\ + ul)X 
{yi + y\) > rn 2 . (Hence by exercise 4 , v\ = min(u?-[-U2, y\ +^2)-) 

► 11. [HM 30 ] (Alan G. Waterman, 1974 .) Invent a reasonably efficient procedure that 
computes multipliers a = 1 (modulo 4 ) for which there exists a relatively prime solution 
to the congruence yi -f“ ay2 = 0 (modulo m ) with yl-\-yl = y/ 4/3 m — e, where e > 0 
is as small as possible, given m = 2 e . (By exercise 10 , this choice of a will guarantee 
that > m 2 /{y\ -f yl) > \/ 3 / 4 m, and there is a chance that will be near its 

optimum value y / 4 / 3 m. In practice we will compute several such multipliers having 
small e, choosing the one with best spectral values 1/2, v 3, ... .) 

12 . [HM 23 ] Prove, without geometrical handwaving, that any solution to the text’s 
“question (b)” must also satisfy the set of equations ( 26 ). 
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13. [HM22] Lemma A uses the fact that U is nonsingular to prove that a positive 
definite quadratic form attains a definite, nonzero minimum value at nonzero integer 
points. Show that this hypothesis is necessary, by exhibiting a quadratic form (19) 
whose matrix of coefficients is singular, and for which the values of f(x i,..., xt) get 
arbitrarily near zero (but never reach it) at nonzero integer points (xi,..., x t ). 

14. [24] Perform Algorithm S by hand, for m — 100, a = 41, T = 3. 

► 15. [ M20} Let U be an integer vector satisfying (15). How many of the (t — 1)- 
dimensional hyperplanes defined by U intersect the unit hypercube { (xi,..., xt) | 
0 < Xj < 1 for 1 < j < t}? (This is approximately the number of hyperplanes in 
the family that will suffice to cover Lo.) 

16. [M30] (U. Dieter.) Show how to modify Algorithm S in order to calculate the 
minimum number Nt of parallel hyperplanes intersecting the unit hypercube as in 
exercise 15, over all U satisfying (15). [Hint: What are appropriate analogs to positive 
definite quadratic forms and to Lemma A?] 

17. [20] Modify Algorithm S so that, in addition to computing the quantities u t , it 

outputs all integer vectors (iti,..., Ut ) satisfying (15) such that uj -\ -1 -u\ = vl, 

for 2 < t < T. 

18. [MS0\ (a) Let m = 2 e , where e is even. By considering “combinatorial matrices,” 
i.e., matrices whose elements have the form y -f xSij (cf. exercise 1.2.3-39), find 3X3 
matrices of integers U and V satisfying (29) such that the transformation of step S5 
does nothing for any j, but the corresponding values of Zk in (31) are so huge that 
exhaustive search is out of the question. (The matrix U need not satisfy (28), we 
are interested here in arbitrary positive definite quadratic forms of determinant m.) 
(b) Although transformation (23) is of no use for the matrices constructed in (a), find 
another transformation that does produce a substantial reduction. 

► 19. [ HM25 ] Suppose step S5 were changed slightly, so that a transformation with 
q = 1 would be performed when 2 Vi • Vj = Vj • Vj. (Thus, q = [(V* • Vj / Vj • V}) + JJ 
in all cases.) Would it be possible for Algorithm S to get into an infinite loop? 

20. [M21 ] Discuss how to carry out an appropriate spectral test for linear congruential 
sequences having c = 0, Xo odd, m = 2 e , a mod 8 — 5. 

21. [M20] (R. W. Gosper.) A certain application uses random numbers in batches of 
four, but “throws away” the second of each set. How can we study the grid structure 
of {^(X 4 n,X4n+2,Xin+3)} ) given a linear congruential generator of period m = 2 e ? 

22. [M46] What is the best upper bound on ^3, given that (12 is very near its maxi¬ 
mum value \J 4/3 7r? What is the best upper bound on //2, given that fiz is very near 
its maximum value |tt\/ 2? 

23. [M46] Let Ui, Vj be vectors of real numbers with Ui-Vj — 6ij for 1 < i, j < t, 
and such that Ui • Ui = 1, 2\U% • Uj | < 1, 2| V% • Vj\ < Vj • Vj for i 7^ j. How large 
can Vi • Vi be? (This question relates to the bounds in step S8, if both (23) and the 
transformation of exercise 18(b) fail to make any reductions. The maximum value 
known to be achievable is (n -J- 2)/3, which occurs when U\ = I\, Uj = \I\ -j- %\/3Ij, 

V\ = I\ — {h H-+ In)/V 3, Vj — 2Ij/Vs, for 2 < j < n, where (R,..., I n ) is the 

identity matrix; this construction is due to B. V. Alekseev [Matematicheskie Zametki, 
to appear].) 
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► 24. [M28] Generalize the spectral test to second-order sequences of the form X n — 
(aX n —i + fiX' n _ 2 )modp, having period length p 2 — 1. (Cf. Eq. 3.2.2-8.) How should 
Algorithm S be modified? 

25. [HM24] Let d be a divisor of m and let 0 < q < d. Prove that r{k), summed 
over all 0 < k < m such that fcmodd = q, is at most (2/dir)ln(m/d) -j- 0(1). (Here 
r(k ) is defined in Eq. (46) when t = 1.) 

26. [ M22} Explain why the derivation of (53) leads to a similar bound on 

iV -1 u qXn 

0<n<N 

for 0 < q < m. Where does the derivation of (53) break down when m — 1? 

27. [. HM39 ] (E. Hlawka, H. Niederreiter.) Let r{u\, ..., u m ) be the function defined 

in (46). Prove that ^ r(ui,..., Ut), summed over all 0 < u\,...,u t < m such 
that 7 ^ (0, ...,0) and (15) holds, is 0((27rlgm) t r max ), where r max is the 

maximum term r{u\,... ,u t ) in the sum. 

► 28. [ M28 ] (H. Niederreiter.) Find an analog of Theorem N for the case m = prime, 
c = 0, a = primitive root modulo m, Xo ^ 0 (modulo m). [Hint: Your exponential 
sums should involve f = e 27rl /( m — 1 ) as we n as w> ] p r0 ve that in this case the “average” 
primitive root has discrepancy D ( ^_ 1 = 0(t(\ogmY/(j){m — 1 )), hence good primitive 
roots exist for all m. 

29. [M21] Prove that quantity r max of exercise 27 is never larger than unless 

V 2 , < 2(t - 1). 

30. [MSS] (S. K. Zaremba.) Prove that in two dimensions, r max < mf max(ai,..., o s ), 
where a\,..., a s are the partial quotients obtained when Euclid’s algorithm is applied to 
m and a. [Hint: We have a/m — fa \,..., a s J in the notation of Section 4.5.3; apply ex¬ 
ercise 4.5.3-42.] 

31. [HM4 7] (I. Borosh.) Prove that for all sufficiently large m there exists a number 
a relatively prime to m such that all partial quotients of a/m are < 3. Furthermore 
the set of all m satisfying this condition but with all partial quotients < 2 has positive 
density. 
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3.4. OTHER TYPES OF RANDOM QUANTITIES 

We have NOW seen how to make a computer generate a sequence of numbers 
Do, U 2 , ... that behaves as if each number were independently selected at 
random between zero and one with the uniform distribution. Applications of 
random numbers often call for other kinds of distributions, however; for example, 
if we want to make a random choice from among k alternatives, we want a 
random integer between 1 and k. If some simulation process calls for a random 
waiting time between occurrences of independent events, a random number with 
the “exponential distribution” is desired. Sometimes we don’t even want random 
numbers —we want a random permutation (i.e., a random arrangement of n 
objects) or a random combination (i.e., a random choice of k objects from a 
collection of n). 

In principle, any of these other random quantities may be obtained from the 
uniform deviates Do, U\, D 2 , ... . People have devised a number of important 
“random tricks” that may be used to perform these manipulations efficiently on 
a computer, and a study of these techniques also gives some insight into the 
proper use of random numbers in any Monte Carlo application. 

It is conceivable that someday somebody will invent a random number 
generator that produces one of these other random quantities directly, instead 
of getting it indirectly via the uniform distribution. But except for the “random 
bit” generator described in Section 3.2.2, no direct methods have so far proved 
to be practical. 

The discussion in the following section assumes the existence of a random 
sequence of uniformly distributed real numbers between zero and one. A new 
uniform deviate D is generated whenever we need it. These numbers are usually 
represented in a computer word with the decimal point assumed at the left. 


3.4.1. Numerical Distributions 

This section summarizes the best techniques known for producing numbers from 
various important distributions. Many of the methods were originally suggested 
by John von Neumann in the early 1950s, and they have gradually been improved 
upon by other people, notably George Marsaglia, J. H. Ahrens, and U. Dieter. 

A. Random choices from a finite set. The simplest and most common type 
of distribution required in practice is a random integer. An integer between 0 
and 7 can be extracted from three bits of D on a binary computer; in such a 
case, these bits should be extracted from the most significant (left-hand) part 
of the computer word, since the least significant bits produced by many random 
number generators are not sufficiently random. (See the discussion in Section 

3.2.1.1. ) 

In general, to get a random integer X between 0 and k — 1, we can multiply 
by k, and let X = [ kU \. On MIX, we would write 


LDA U 
MUL K 


(i) 
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and after these two instructions have been executed the desired integer will 
appear in register A. If a random integer between 1 and k is desired, we add one 
to this result. (The instruction “INCA 1” would follow (1).) 

This method gives each integer with nearly equal probability. There is a 
slight error because the computer word size is finite (see exercise 2); but the 
error is quite negligible if k is small, for example if k/m < 1/10000. 

In a more general situation we might want to give different weights to 
different integers. Suppose that the value X = Xi is to be obtained with 
probability pi, and X = X 2 with probability p 2 > ..., X — Xk with probability 
Pk- We can generate a uniform number U and let 

if 0 < 17 < pi; 

if pi < u < pi + p 2 ; 

(2) 

if pi +p 2 H- \~Pk -i < U < I- 

(Note that pi + p 2 H- \~Pk = 1.) 

There is a “best possible” way to do the comparisons of U against various 
values of pi -}- p 2 + ••• + Ps, as implied in (2); this situation is discussed in 
Section 2.3.4.5. Special cases can be handled by more efficient methods; for 
example, to obtain one of the eleven values 2, 3, ..., 12 with the respective “dice” 
probabilities jg, ..., ..., we could compute two independent 

random integers between 1 and 6 and add them together. 

However, none of the above techniques is really the fastest general method 
for selecting x\, ..., Xk with the correct probabilities. An ingenious way to do 
the trick has been discovered by A. J. Walker [Electronics Letters 10,8 (1974), 
127-128; ACM Trans. Math. Software 3 (1976), 253-256]. Suppose we form kU 
and consider the integer part K = [kU J and fraction part V = (kU )modi 
separately; for example, after the code (1) we will have K in register A and V 
in register X. Then we can always obtain the desired distribution by doing the 
operations 



if V < Pk then X Xk +i otherwise X <— Yk, (3) 

for some appropriate tables (Po ,..., Pk— i) and (Yq, • • •, Yfc—i)* Exercise 7 shows 
how such tables can be computed in general. Walker’s method is sometimes 
called the method of “aliases.” 

On a binary computer it is usually helpful to assume that 1c is a power of 2, 
so that multiplication can be replaced by shifting; this can be done without loss 
of generality by introducing additional x’s that occur with probability zero. For 
example, let’s consider dice again; suppose we want X = j to occur with the 
following 16 probabilities: 

j — 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
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We can do this using (3), if fc = 16 and x 3 = j for 0 < j < 16, and if the P 
and Y tables are set up as follows: 

P j = 0 0 | | 1 J 1 1 1 l l f } 0 0 0 

= 5974*6***847 10 678 

(When Pj = 1, Yj is not used.) For example, the value 7 occurs with probability 
Tg • ((1 — P 2 ) + Pi H- (1 — Pn) + (1 — P 14 )) — ^ as required. It is a peculiar 
way to throw dice, but the results are indistinguishable from the real thing. 

B. General methods for continuous distributions. The most general real-valued 
distribution may be expressed in terms of its “distribution function” F(x), which 
specifies the probability that a random quantity X will not exceed x: 

F(x) = probability that (X < x). (4) 

This function always increases monotonically from zero to one; i.e., 

F(xi) < F(x 2 ), if xi < x 2 ; F(— 00 ) — 0, F (-f 00 ) = 1. (5) 

Examples of distribution functions are given in Section 3.3.1, Fig. 3. If F{x) 
is continuous and strictly increasing (so that F(xi) < F(x 2 ) when X\ < x 2 ), it 
takes on all values between zero and one, and there is an inverse function F~ 1 (y) 
such that, for 0 < y < 1, 

y — F{x) if and only if x = F~ l {y). (6) 

In general we can compute a random quantity X with the continuous, strictly 
increasing distribution F(x) by setting 

X = F-\U\ (7) 

where U is uniform; this works because the probability that X < x is the prob¬ 
ability that F~ 1 (U) < x , i.e., the probability that U < F{x ), and this is F(x). 

The problem now reduces to one of numerical analysis, namely to find good 
methods for evaluating F~ 1 (U) to the desired accuracy. Numerical analysis 
lies outside the scope of this seminumerical book; yet there are a number of 
important shortcuts available to speed up this general approach, and we will 
consider them here. 

In the first place, if Xi is a random variable having the distribution Fi(x) 
and if X 2 is an independent random variable with the distribution F 2 (x), then 

max(Xi,X 2 ) has the distribution Fi(x)F 2 (x), 

( O ) 

min(Xi,X 2 ) has the distribution Fi(x) + F 2 (x) — Fi(x)F 2 (x). 

(See exercise 4.) For example, a uniform deviate U has the distribution F(x) = x, 
for 0 < x < 1; if C/i, C/ 2 , Ut are independent uniform deviates, then 
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max(£/i, U 2l . ■ •, U t ) has the distribution function F(x) = x 1 , for 0 < x < 1. 
This is the basis of the “maximum-of-t test” given in Section 3.3.2. Note that 
the inverse function in this case is F~ 1 (y) = tfy. In the special case t = 2, we 
see therefore that the two formulas 

X = VU and X = max(£/i, U 2 ) (9) 

will give equivalent distributions to the random variable X, although this is not 
obvious at first glance. We need not take the square root of a uniform deviate. 

The number of tricks like this is endless: any algorithm that employs random 
numbers as input will give a random quantity with some distribution as output. 
The problem is to find general methods for constructing the algorithm, given the 
distribution function of the output. Instead of discussing such methods in purely 
abstract terms, we shall study how they can be applied in important cases. 

C. The normal distribution. Perhaps the most important nonuniform, continuous 
distribution is the so-called normal distribution with mean zero and standard 
deviation one: 

/ X 

e~* 2 / 2 dt. (10) 

v £■» -°° 

The significance of this distribution was indicated in Section 1.2.10. Note that 
the inverse function F~ x is not especially easy to compute; but we shall see that 
several other techniques are available. 

(1) The polar method, due to G. E. P. Box, M. E. Muller, and G. Marsaglia. 
(See Annals Math. Stat. 28 (1958), 610; and Boeing Scientific Res. Lab. report 
Dl-82-0203 (1962).) 

Algorithm P ( Polar method for normal deviates ). This algorithm calculates two 
independent normally distributed variables, X\ and X 2 . 

PI. [Get uniform variables.] Generate two independent random variables, U\, 
U 2 , uniformly distributed between zero and one. Set V\ <— 2 U\ — 1, V 2 <— 
2 U 2 — 1. (Now V\ and V 2 are uniformly distributed between —1 and -|-1. 
On most computers it will be preferable to have Vi and V 2 represented in 
floating point form at this point.) 

P2. [Compute S] Set S <- V\ + V\. 

P3. [Is S > 1?] If S > 1, return to step PI. (Steps PI through P3 are executed 
1.27 times on the average, with a standard deviation of 0.587; see exercise 6.) 

P4. [Compute Xi,X 2 .] Set Xi, X 2 according to the following two equations: 


X x =Vy 


—2 In S 


S 


x 2 = v 2 < 


—2 In S 


S 



These are the normally distributed variables desired. | 
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To prove the validity of this method, we use elementary analytic geometry 
and calculus: If S < 1 in step P3, the point in the plane with Cartesian 
coordinates (Vi, V 2 ) is a random point uniformly distributed inside the unit circle. 
Transfor ming to polar coordinates V\ = Rcos 0, V 2 = R sin 0, we find S = R 2 , 
Xi = \/—21ns 1 cosO, X 2 = V —2 In S sin©. Using also the polar coordinates 
Xi = R' cos© 7 , X 2 = R'sin©', we find that 0 ; = 0 and R' = \/—21nS. 
It is clear that R' and 0' are independent, since R and © are independent 
inside the unit circle. Also, 0' is uniformly distributed between 0 and 2w; and 
the probability that R' < r is the probability that —2 In S < r 2 , i.e., the 
probability that S > e~ r J ' 2 . This equals 1 — e~ r / 2 , since S = R 2 is uniformly 
distributed between zero and one. The probability that R' lies between r and 
r-\-dr is therefore the derivative of 1— e~ r ! 2 , namely, re~ r / 2 dr. Similarly, the 
probability that 0' lies between 6 and 6 -f d6 is (l/27r)d0. The joint probability 
that X\ < Xi and that X 2 < x 2 now can be computed, it is 


f ~e r2 / 2 rdrd9 

d{(r,9) \r cos 9< xi , r sin 9 < 12 } 27T 


1 / 
2tt Jf 


2?r J{(x,y) | x<x 1)y <x 2 } 


e ( x ^ 2 dxdy 



This proves that Xi and X 2 are independent and normally distributed, as 
desired. 

(2) The rectangle-wedge-tail method, introduced by G. Marsaglia. In this method 
we use the distribution 


F(x) = ^ e t2/2 dt, x>0, (12) 

so that F{x) gives the distribution of the absolute value of a normal deviate. 
After X has been computed according to distribution (12), we will attach a 
random sign to its value, and this will make it a true normal deviate. 

The rectangle-wedge-tail approach is based on several important general 
techniques that we shall explore as we develop the algorithm. The first key idea 
is to regard F(x) as a mixture of several other functions, namely to write 

F(x) = p\F\(x) + p 2 F 2 (x) -J- p n F n (x), (13) 

where Fi, F 2 , ..., F n are appropriate distributions and pi, p 2 , ..., p n are 
nonnegative probabilities that sum to 1. If we generate a random variable X by 
choosing distribution Fj with probability pj, it is easy to see that X will have 
distribution F overall. Some of the distributions Fj(x) may be rather difficult to 
handle, even harder than F itself, but we can usually arrange things so that the 
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Fig. 9. The density function divided into 31 parts. The area of each part represents 
the average number of times a random number with that density is to be computed. 


probability p 3 is very small in this case. Most of the distributions Fj(x) will be 
quite easy to accommodate, since they will be trivial modifications of the uniform 
distribution. The resulting method yields an extremely efficient program, since 
its average running time is very small. 

It is easier to understand the method if we work with the derivatives of the 
distributions instead of the distributions themselves. Let 

f(x) = F'(x), f 3 (x) = F/(x); 

these are called the density functions of the probability distributions. Equation 
(13) becomes 

/(z) = Plflfe) + P 2 / 2 M H-K Pnfn{x). (14) 

Each fj(x) is > 0, and the total area under the graph of fj(x) is 1; so there is 
a convenient graphical way to display the relation (14): The area under f(x) is 
divided into n parts, with the part corresponding to fj(x) having area pj. See 
Fig. 9, which illus trate s the situation in the case of interest to us here, with 
f(x) = F'(x) = \f 2 fiT e~ x / 2 ; the area under this curve has been divided into 
n = 31 parts. There are 15 rectangles, which represent pifi(x), ..., £ 15 / 15 ( 0 ;); 
there are 15 wedge-shaped pieces, which represent £ 16 / 16 ( 2 ;), ..., £ 30 / 30 ( 2 ;); and 
the remaining part £ 31 / 31 ( 2 ;) is the “tail,” namely the entire graph of f(x) for 
x > 3. 

The rectangular parts fi{x), ..., /i 5 (a;) represent uniform distributions. For 
example, fs(x) represents a random variable uniformly distributed between § 
and §. The altitude of Pjfj(x) is /(//5), hence the area of the jth rectangle is 



Pj = 4/0/5) = 


for 1 < j < 15. 


(15) 
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Fig. 10. Density functions for which Algorithm L may be used to generate random 
numbers. 


In order to generate such rectangular portions of the distribution, we simply 
compute 

X = IU + S, (16) 

where U is uniform and S takes the value (j — l)/5 with probability pj. Since 
Pi + ■ ■ • + P 15 = .9183, we can use simple uniform deviates like this about 92 
percent of the time. 

In the remaining 8 percent, we will usually have to generate one of the 
wedge-shaped distributions F\e, ..., F 30 . Typical examples of what we need to 
do are shown in Fig. 10. When x < 1 , the curved part is concave downward, and 
when x > 1 it is concave upward, but in each case the curved part is reasonably 
close to a straight line, and it can be enclosed in two parallel lines as shown. 

To handle these wedge-shaped distributions, we will rely on yet another 
general technique, von Neumann’s rejection method for obtaining a complicated 
density from another one that “encloses” it. The polar method described above is 
a simple example of such an approach: Steps P1-P3 obtain a random point inside 
the unit circle by first generating a random point in a larger square, rejecting it 
and starting over again if the point was outside the circle. 

The general rejection method is even more powerful than this. To generate a 
random variable X with density /, let g be another probability density function 
such that 

f(t) < cg{t) (17) 

for all t, where c is a constant. Now generate X according to density g , and also 
generate an independent uniform deviate U. If U > f(X)/cg(X), reject X and 
start again with another X and U. When the condition U < f{X)/cg{X) finally 
occurs, the resulting X will have density / as desired. [Proof: X < x will occur 
with probability p(x) = /^^(gfydt • f{t)/cg{t)) + qp{x), where the quantity 

q = f—oo(ff(t)dt • (1 — f(t)/cg(t))) = 1 — 1 /c is the probability of rejection; 
hence p(x) = 

The rejection technique is most efficient when c is small, since there will be 
c iterations on the average before a value is accepted. (See exercise 6 .) In some 
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Fig. 11. Region of “acceptance” in Algorithm L. 


cases f(x)/cg(x) is always 0 or 1 ; then U need not be generated. In other cases 
if f(x)/cg(x) is hard to compute, we may be able to “squeeze” it between two 
bounding functions 

r(x) < f(x)/cg(x) < s(x) (18) 

that are much simpler, and the exact value of f(x)/cg(x) need not be calculated 
unless r(x) < U < s(x). The following algorithm solves the wedge problem by 
developing the rejection method still further. 

Algorithm L (Nearly linear densities). This algorithm may be used to generate a 
random variable X for any distribution whose density f(x) satisfies the following 
conditions (cf. Fig. 10): 

f(x) = 0 , for x < s and for x > s -\- h; 

a — b(x — s)/h < f(x) < b — b(x — s)/h , for s < x < s -\- h. ^ ^ 


LI. [Get U < V.} Generate two independent random variables U, V, uniformly 
distributed between zero and one. If U > V, exchange U <-* V. 

L2. [Easy case?] If V < a/b, go to L4. 

L3. [Try again?] If V > [/ + (1 /b)f(s + hU), go back to step LI. (If a/b is 
close to 1 , this step of the algorithm will not be necessary very often.) 

L4. [Compute X.] Set X *- s + hU. | 

When step L4 is reached, the point (U,V) is a random point in the area 
shaded in Fig. 11, namely, 0<U<V<U-\- (1 /b)f(s-\- hU). Conditions (19) 
ensure that 

l < u + 1 f(s + hU) < 1. 
b b 

Now the probability that X < s -J- hx, for 0 < x < 1, is the ratio of area to 
the left of the vertical line U = x in Fig. 11 to the total area, namely, 

^f(s + hu)du j i/(s + hu)du — j* f(v)dv; 



therefore X has the correct distribution. 
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Fig. 12. The ‘‘rectangle-wedge-tail” algorithm for generating normal deviates. 


With appropriate constants a 3 , b 3 , s 3 , Algorithm L will take care of the 
wedge-shaped densities fj +15 of Fig. 9, for 1 < j < 15. The final distribution, 
F 31 , needs to be treated only about one time in 370; it is used whenever a result 
X > 3 is to be computed. Exercise 11 shows that a standard rejection scheme 
can be used for this “tail”; hence we are ready to consider the procedure in its 
entirety: 

Algorithm M ( Rectangle-wedge-tail method for normal deviates). This algorithm 
uses a number of auxiliary tables (Po, • • ., P 31 ), (Q 1 ,..., Q 15 ), (To, • • •, T 31 ), 
(Z 0 ,...,Z 3 1 ), (S 1 ,...,S 16 ), (D 16 , ... ,D 30 ), (E 16 ,.. .,E 3 o), constructed as ex¬ 
plained in exercise 10; examples appear in Table 1. We assume that a binary 
computer is being used; a similar procedure could be worked out for decimal 
machines. 

Ml. [Get U.] Generate a uniform random number U = (.606162 • • • 6 t ) 2 . (Here the 
6 ’s are the bits in the binary representation of U. For reasonable accuracy, 
t should be at least 24.) Set ip <— 6 0 . (Later, ip will be used to determine 
the sign of the result.) 

M 2 . [Rectangle?] Set j <— ( 6162636465 ) 2 , a binary number determined by the 
leading bits of U, and set / <— (. 6 6 6 7 ... 6 t) 2 , the fraction determined by the 
remaining bits. If / > P 3 , set X <— Yj fZj and go to M9. Otherwise 
if j < 15 (i.e., 61 = 0), set X <— Sj -\- fQ 3 and go to M9. (This is an 
adaptation of Walker’s alias method (3).) 
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Table 1 


EXAMPLE OF TABLES USED WITH ALGORITHM M* 


J 

Pj 

Pj +16 

Qj 

Yj 

Yj + 16 

% 


s i +1 

+15 

15 

0 

.000 

.067 


0.00 

0.59 

0.20 

0.21 

0.0 



1 

.849 

.161 

.236 

— 0.92 

0.96 

1.32 

0.24 

0.2 

.505 

25.00 

2 

.970 

.236 

.206 

— 5.86 

— 0.06 

6.66 

0.26 

0.4 

.773 

12.50 

3 

.855 

.285 

.234 

- 0.58 

0.12 

1.38 

0.28 

0.6 

.876 

8.33 

4 

.994 

.308 

.201 

— 33.13 

1.31 

34.93 

0.29 

0.8 

.939 

6.25 

5 

.995 

.304 

.201 

— 39.55 

0.31 

41.35 

0.29 

1.0 

.986 

5.00 

6 

.933 

.280 

.214 

— 2.57 

1.12 

2.97 

0.28 

1.2 

.995 

4.06 

7 

.923 

.241 

.217 

— 1.61 

0.54 

2.61 

0.26 

1.4 

.987 

3.37 

8 

.727 

.197 

.275 

0.67 

0.75 

0.73 

0.25 

1.6 

.979 

2.86 

9 

1.000 

.152 

.200 

0.00 

0.56 

0.00 

0.24 

1.8 

.972 

2.47 

10 

.691 

.112 

.289 

0.35 

0.17 

0.65 

0.23 

2.0 

.966 

2.16 

11 

.454 

.079 

.440 

— 0.17 

0.38 

0.37 

0.22 

2.2 

.960 

1.92 

12 

.287 

.052 

.698 

0.92 

— 0.01 

0.28 

0.21 

2.4 

.954 

1.71 

13 

.174 

.033 

1.150 

0.36 

0.39 

0.24 

0.21 

2.6 

.948 

1.54 

14 

.101 

.020 

1.974 

— 0.02 

0.20 

0.22 

0.20 

2.8 

.942 

1.40 

15 

.057 

.086 

3.526 

0.19 

0.78 

0.21 

0.22 

3.0 

.936 

1.27 


*In practice, this data would be given with much greater precision; the table shows only enough 
figures so that interested readers may test their own algorithms for computing the values more 
accurately. 


M3. [Wedge or tail?] (Now 15 < j < 31, and each particular value j occurs with 
probability pj.) If j = 31, go to M7. 

M4. [Get U < V.] Generate two new uniform deviates, U, V; if U > V, 
exchange U *■+ V. (We are now performing Algorithm L.) Set X <— Sj —15 + 

w- 

M5. [Easy case?] If V < Dj, go to M9. 

M6. [Another try?] If V > U + Ej(e^ s ^~ 14 ~ x2 ^ 2 — 1), go back to step M4; 
otherwise go to M9. (This step is executed with low probability.) 

M7. [Get supertail devi ate.] Gener ate two new independent uniform deviates, U, 
V, and set X <— \/9 — 21n VG 

M8. [Reject?] If UX > 3, go back to step M7. (This will occur only about 
one-twelfth as often as we reach step M 8 .) 

M9. [Attach sign.] If ^ — 1 , set X < -X. | 

This algorithm is a very pretty example of mathematical theory intimately 
interwoven with programming ingenuity—a fine illustration of the art of com¬ 
puter programming! Only steps Ml, M 2 , and M9 need to be performed most of 
the time, and the other steps aren’t terribly slow either. The first publications 
of the rectangle-wedge-tail method were by G. Marsaglia, Annals Math. Stat. 
32 (1961), 894-899; G. Marsaglia, M. D. MacLaren, and T. A. Bray, CACM 7 
(1964), 4-10. Further refinements of Algorithm M have been developed by G. 
Marsaglia, K. Ananthanarayanan, and N. J. Paul, Inf. Proc . Letters 5 (1976), 
27-30. 
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(5) The odd-even method, due to G. E. Forsythe. An amazingly simple technique 
for generating random deviates with a density of the general exponential form 

f(x) = Ce~ h ^ x \ for a < x < b, f(x ) = 0 otherwise, (20) 


when 

0 < h(x) <1 for a < x < b, (21) 

was discovered by John von Neumann and G. E. Forsythe about 1950. The idea 
is based on the rejection method described earlier, letting g(x) be the uniform 
distribution on [a, b ): We set X 4— a-\-(b — a)U, where U is a uniform deviate, and 
then we want to accept X with probability e~ h ( x \ The latter operation could 
be done by testing e~ vs. V, or h(X) vs. — In V, when V is another uniform 
deviate, but the job can be done without applying any transcendental functions 
in the following interesting way. Set V 0 h(X), then generate uniform deviates 
Vi, V2, ... until finding some K > 1 with Vk —1 < Vk* For fixed X and k, 
the probability that h(X) > Vi > • • • > Vjt is 1 /k\ times the probability that 
max(Vi,..., Vk) < /i(X), namely h(X) k /k\; hence the probability that K = k 
is h{X) k ~ l !(k — 1 )! — h(X) k /k\, and the probability that K is odd is 


E 

k odd, fc> 1 


fhiXf - 1 

V (fc -1)1 


- e -M*> 

fc! J " 


( 22 ) 


Therefore we reject X and try again if K is even; we accept X as a random 
variable with density (20) if K is odd. Note that we usually won’t have to gen¬ 
erate many V’s in order to determine K, since the average value of K (given X) 
is X) fc > 0 probability that(A7 > k) = ^ fc>0 h(X) k /k\ = e h ( x>) < e. 

Forsythe realized some years later that this approach leads to an efficient 
method for calculating normal deviates, without the need for any auxiliary 
routines to calculate square roots or logarithms as in Algorithms P and M. His 
procedure, with an improved choice of intervals [a, b ) due to J. H. Ahrens and 
U. Dieter, can be summarized as follows. 

Algorithm F ( Odd-even method for normal deviates ). This algorithm generates 
normal deviates on a binary computer, assuming approximately t -f- 1 bits of 
accuracy. The algorithm requires a table of values dj = a 3 — 15 for 1 <3 < 
t-\- 1, where a 3 is defined by the relation 



FI. [Get l/.] Generate a uniform random number U = {.bof>i • • • M2> where bo, 
,.., b t denote the bits in binary notation. Set 'tp bo, j ■*— 1 , and a 4 — 0. 

F2. [Find first zero bj.] If bj = 1, set a 4— a -J- dj, j 4— j -f- 1, and repeat this 
step. (If j — t -j- 1, treat bj as zero.) 

F3. [Generate candidate.) (Now a = Oj—i, and the current value of j occurs 
with probability % 2~ J . We will generate X in the range [ay_ 1 , a-,), using 




3.4.1 


NUMERICAL DISTRIBUTIONS 125 


the rejection method described above, with h(x) — x 2 /2—a 2 /2 — y 2 /2-\-ay 
where y = x — a. Exercise 12 proves that h(x) < 1 as required in (21).) 
Set Y <— dj times (.bj- |_i.. .bt) 2 and V <— (%Y -J- a)Y\ (Since the average 
value of j is 2, there will usually be enough significant bits in (.bj+\ ... bt )2 
to provide decent accuracy. The calculations are readily done in fixed point 
arithmetic.) 

F4. [Reject?] Generate a uniform deviate U. If V < 17, go on to step F5. 
Otherwise set V to a new uniform deviate; and if now U < V (i.e., if K is 
even, in the discussion above), go back to F3, otherwise repeat step F4. 

F5. [Return X .] Set X <— a + Y. If -0 = 1, set X «- X. | 

Values of dj for 1 < j < 47 appear in a paper by Ahrens and Dieter, Math. 
Comp. 27 (1973), 927-937; their paper discusses refinements of the algorithm 
that improve its speed at the expense of more tables. Algorithm F is attractive 
since it is almost as fast as Algorithm M and it is easier to implement. The 
average number of uniform deviates per normal deviate is 2.53947; R. P. Brent 
[CACM 17 (1974), 704-705] has shown how to reduce this number to 1.37446 at 
the expense of two subtractions and one division per uniform deviate saved. 

(4 ) Ratios of uniform deviates. There is yet another good way to generate normal 
deviates, discovered by A. J. Kinderman and J. F. Monahan in 1976. Their idea 
is to generate a random point (U, V) in the region defined by 

0 < u < 1, —2 uyj ln(l/it) < v < 2 uyj ln(l/u), (24) 

and then to output the ratio X <— V/U. The shaded area of Fig. 13 is the magic 
region (24) that makes this all work. Before we study the associated theory, let 
us first state the algorithm so that its efficiency and simplicity are manifest: 

Algorithm R (Ratio method for normal deviates). This algorithm generates 
normal deviates X. 

Rl. [Get U, V.] Generate two independent uniform deviates U and V, where 
U is nonzero, and set X <— \f%jeiy — \)/U. (Now X is the ratio of 
the coordinates {U , \/sJe(V — J)) of a random point in the rectangle that 
encloses the shaded region in Fig. 13. We will accept X if the corresponding 
point actually lies “in the shade,” otherwise we will try again.) 

R2. [Optional upper bound test.] If X 2 < 5 — 4e 1//4 [7, output X and terminate 
the algorithm. (This step can be omitted if desired; it tests whether or not 
the selected point is in the interior region of Fig. 13, making it unnecessary 
to calculate a logarithm.) 

R3. [Optional lower bound test.] If X 2 > 4e 1 * 35 /C7 + 1.4, go back to Rl. 
(This step could also be omitted; it tests whether or not the selected point 
is outside the exterior region of Fig. 13, making it unnecessary to calculate 
a logarithm.) 

R4. [Final test.] If X 2 < —4/ln U, output X and terminate the algorithm. 
Otherwise go back to Rl. | 
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(0,V2/c) 


(0,-V2/e) 


2 — 3 


x= —3 


Fig. 13. Region of 
“acceptance” in the 
ratio - of - uniforms 
method for normal 
deviates. Lengths 
of lines with coor¬ 
dinate ratio x have 
the normal distribu¬ 
tion. 


Exercises 20 and 21 work out the timing analysis; four different algorithms 
are analyzed, since steps R2 and R3 can be included or omitted depending on 
one’s preference. The following table shows how many times each step will be 
performed, on the average, depending on which of the optional tests is applied: 


Step 

Neither 

R2 only 

R3 only 

Both 

R1 

1.369 

1.369 

1.369 

1.369 

R2 

0 

1.369 

0 

1.369 

R3 

0 

0 

1.369 

0.467 

R4 

1.369 

0.467 

1.134 

0.232 


Thus it pays to omit the optional tests if there is a very fast logarithm operation, 
but if the log routine is rather slow it pays to include them. 

But why does it work? One reason is that we can calculate the probability 
that X < x, and it turns out to be the correct value (10). But such a calculation 
isn’t very easy unless one happens to hit on the right “trick,” and anyway it is 
better to understand how the algorithm might have been discovered in the first 
place. Kinderman and Monahan derived it by working out the following theory 
that can be used with any well-behaved density function f(x) [cf. ACM Trans. 
Math. Software 3 (1977), 257-260]. 

In general, suppose that a point {U,V) has been generated uniformly over 
the region of the (u, d)- plane defined by 

u > 0, u 2 < g{v/u) 


(26) 
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for some nonnegative integrable function g. If we set X <— V/U, the probability 
that X < x can be calculated by integrating du dv over the region defined by the 
two relations in (26) plus the auxiliary condition v/u < x, then dividing by the 
same integral without this extra condition. Letting v = tu so that dv = udt, 
the integral becomes 



Hence the probability that X < x is 

/ X , /> + oo 

g(t)dt / / g(t)dt. (27) 

-oo / J —oo 

The normal distribution comes out when g(t) = e — * 2 / 2 ; and the condition 
u 2 < g(v/u) simplifies in this case to (v/u) 2 < —41nu. It is easy to see that 
the set of all (u, v) satisfying this relation is entirely contained in the rectangle 
of Fig. 13. 

The bounds in steps R2 and R3 define interior and exterior regions with 
simpler boundary equations. The well-known inequality 

e x > \ x, 

which holds for all real numbers x, can be used to show that 

1+lnc — cu < —In u < 1/cu — 1+lnc (28) 

for any constant c > 0. Exercise 21 proves that c = e 1 / 4 is the best possible 
constant to use in step R2. The situation is more complicated in step R3, and 
there doesn’t seem to be a simple expression for the optimum c in that case, but 
computational experiments show that the best value for R3 is approximately 
e 1,35 . The approximating curves (28) are tangent to the true boundary when 
u = 1/c. 

It is possible to obtain a faster method by partitioning the region into 
subregions, most of which can be handled more quickly. Of course, this means 
that auxiliary tables will be needed, as in Algorithms M and F. 

(5) Variations of the normal distribution. So far we have considered the normal 
distribution with mean zero and standard deviation one. If X has this distribu¬ 
tion, then 

Y = p-\-oX (29) 

has the normal distribution with mean p and standard deviation o. Furthermore, 
if X\ and X 2 are independent normal deviates with mean zero and standard 
deviation one, and if 

VI = Mi + 01 X 1 , Y 2 = M 2 + OiipXt + 0 - ? X 2 ), (30) 

then Y\ and Y 2 are dependent random variables, normally distributed with means 
fi 1 , fi 2 and standard deviations 01 , 02 , and with correlation coefficient p. (For 
a generalization to n variables, see exercise 13.) 
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D. The exponential distribution. After uniform deviates and normal deviates, the 
next most important random quantity is an exponential deviate. Such numbers 
occur in “arrival time” situations; for example, if a radioactive substance emits 
alpha particles at a rate so that one particle is emitted every p seconds on the 
average, then the time between two successive emissions has the exponential 
distribution with mean p. This distribution is defined by the formula 

F(x) = 1 — e~ x/lx , x > 0. (31) 


(1) Logarithm method. Clearly, if y = F(x) = 1 — e ~ x !then x = F~ 1 (y) = 
—p ln(l — y). Therefore by Eq. (7), — pln(l — U) has the exponential distribu¬ 
tion. Since 1 — U is uniformly distributed when U is, we conclude that 


X = — ph\U 


(32) 


is exponentially distributed with mean p. (The case U = 0 must be avoided.) 

(2) Random minimization method. We saw in Algorithm F that there are simple 
and fast alternatives to calculating the logarithm of a uniform deviate. The 
following especially efficient approach has been developed by G. Marsaglia, M. 
Sibuya, and J. H. Ahrens. 

Algorithm S (Exponential distribution with mean p). This algorithm produces 
exponential deviates on a binary computer, using uniform deviates with £-bit 
accuracy. The constants 



In 2 (In 2) 2 (In 2) fc 

ir + _ 2r + "' + ^r 


k > 1, 



should be precomputed, extending until Q[k] > 1 — 2 1— t . 

51. [Get U and shift.] Generate a t-bit uniform random binary fraction U = 
(.6162 • • • h) 2 ', locate the first zero bit bj, and shift off the leading j bits, 
setting U <— (.ty+i... b t ) 2 . (As in Algorithm F, the average value of j is 2.) 

52. [Immediate acceptance?] If U < In 2, set X <— p(j In 2 U) and terminate 
the algorithm. (Note that Q[ 1 ] = In 2 .) 

53. [Minimize.] Find the least k > 2 such that U < Q[k\. Generate k new 
uniform deviates U\, ..., U k and set V <- min(C/i,..., U k ). 

54. [Deliver the answer.] Set X <— p(j -f- V) In 2. | 

Alternative ways to generate exponential deviates (e.g., a ratio of uniforms 
as in Algorithm R) might also be used. 
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E. Other continuous distributions. Let us now consider briefly how to handle 
some other distributions that arise reasonably often in practice. 

(1) The gamma distribution of order a > 0 is defined by 


F{x) = rh\f 

I (a) Jo 


l e 1 dt. 


x > 0. 


When a = l, this is the exponential distribution with mean 1; when a — 
it is the distribution of \Z 2 , where Z has the normal distribution (mean 0, 
variance 1). If X and Y are independent gamma-distributed random variables, 
of order a and b , respectively, then X -\~Y has the gamma distribution of order 
a -j- b. Thus, for example, the sum of k independent exponential deviates with 
mean 1 has the gamma distribution of order k. If the logarithm method (32) 
is being used to generate these exponential deviates, we need compute only one 

logarithm: X <-ln(f/i... Uk), where U\, ..., Uk are nonzero uniform deviates. 

This technique handles all integer orders a; to complete the picture, a suitable 
method for 0 < a < 1 appears in exercise 16. 

The simple logarithm method is much too slow when a is large, since it 
requires [aj uniform deviates. For large a, the following algorithm due to J. H. 
Ahrens is reasonably efficient, and it is easy to write in terms of standard 
subroutines. 

Algorithm A ( Gamma distribution of order a > 1). 

Al. [Generate candidate.] Set Y <— t&n(irU), where U is a uniform deviate, and 
set X \/2a — 1 Y -) -a — 1. (In place of tan(7rC/) we could use a polar 
method, e.g., V 2 /V 1 in step P4 of Algorithm P.) 

A2. [Accept?] If X < 0, return to Al. Otherwise generate a uniform deviate V, 
and return to Al if V > (1 -|-y 2 )exp((a — l)ln(X/(a — 1)) — y/2a — IF). 
Otherwise accept X. | 

The average number of times step Al is performed is < 1.902 when a > 3. 
For further discussion, proof, and a slightly more complex method that is two to 
three times faster, see J. H. Ahrens and U. Dieter, Computing 12 (1974), 223-246. 

There is also an attractive approach for large a based on the remarkable fact 
that gamma deviates are approximately equal to aX 3 , where X is normally dis¬ 
tributed with mean 1 — 1 /(9a) and standard deviation l/y/9a; see G. Marsaglia, 
Computers and Math. 3 (1977), 321-325.* 

(2) The beta distribution with positive parameters a and b is defined by 


F(x) = 


r(q + b) f 
r(a) r(6) Jo 


1 dt, 


0 < x < 1. 


Let Xi and X 2 be independent gamma deviates of order a and b, respectively, 
and set X <— Xi/(Xi -}-X 2 ). Another method, useful for small a and b, is to set 

^Change “+(3a — 1)” to “—(3a — 1)” in Step 3 of the algorithm on page 323. 
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Yi <- U\ /a and V 2 — U\ lb repeatedly until Y, +Y 2 < 1; then X <- Y 1 /(Y 1 +Y 2 ). 
[See M. D. Johnk, Metrika 8 (1964), 5-15.] Still another approach, if a and 6 are 
integers (not too large), is to set X to the 6th largest of a + 6 — 1 independent 
uniform deviates (cf. exercise 7 at the beginning of Chapter 5). See also the 
direct method described by R. C. H. Cheng, CACM 21 (1978), 317-322. 

(5) The chi-square distribution with v degrees of freedom (Eq. 3.3.1-22) is 
obtained by setting X <— 2 Y, where Y is a random variable having the gamma 
distribution of order i//2. 

(4) The F-distribution (variance-ratio distribution) with tq and v 2 degrees of 
freedom is defined by 


F(x) = 


, /2 ..i^a/2 

u \ U 2 




(36) 

where x > 0. Let Yi and Y 2 be independent, having the chi-square distribution 
with v i and v 2 degrees of freedom, respectively; set X *- Y\V 2 jY 2 V\. Or set 
X <— is 2 Y /iq (1 — Y), where Y is a beta variate with parameters tq/2, is 2 / 2. 


(5) The t-distribution with v degrees of freedom is defined by 


F(x) = 


r(fr +1)/2) r 

y/WTT^/2) 7—oo 


+ 1 2 dt. 


(37) 


Let Yi be a normal deviate (mean 0, variance 1) and let Y 2 be independent 
of Yi , havi ng the chi-square distribution with v degrees of freedom; set X <— 
Y\j\jY 2 jv. Alternatively, when v > 2, let Y\ be a normal deviate and let 
Y 2 independently have the exponential distribution with mean 2/(v — 2); set 
Z <— Y\j{y — 2) and reject (Yi,Y 2 ) if e ~ Y2 ~ z > 1 — Z, otherwise set X +~ 
Yi/y/(l 2iy)(l — z). The latter method is due to George Marsaglia, Math. 
Comp. 34 (1980), 235-236. [See also A. J. Kinderman, J. F. Monahan, and J. G. 
Ramage, Math. Comp. 31 (1977), 1009-1018.] 

(6) Random point on n-dimensional sphere with radius one. Let Xi, X 2 , ..., 
X n be independent normal deviates (mean 0, variance 1); the desired point on 
the unit sphere is 


(X 1 /r,X 2 /r,...,X n /r), where r = yjx\ + X\ -f — + X\. (38) 

Note that if the X’s are calculated using the polar method, Algorithm P, we 
compute two independent X’s each time, and X\ -j- X\ = —2 In S (in the 
notation of that algorithm); this saves a little of the time needed to evaluate r. 
The validity of this method comes from the fact that the distribution function for 
the point (Xi,..., X n ) has a density that depends only on its distance from the 
origin, so when it is projected onto the unit sphere it has the uniform distribution. 
This method was first suggested by G. W. Brown, in Modern Mathematics for 
the Engineer, First series, ed. by E. F. Beckenbach (New York: McGraw-Hill, 
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1956), p. 302. To get a random point inside the n-sphere, R. P. Brent suggests 
taking a point on the surface and multiplying it by U l / n . 

In three dimensions a significantly simpler method can be used, since each 
individual coordinate is uniformly distributed between —1 and 1: Find VT, V 2 , 
and S by steps P1-P3 of Algorithm P; then the desired random point on the 
surface of a globe is (aV 1 , aVb, 2 S — 1), where a = 2\/l — S. [Robert E. Knop, 
CACM 13 (1970), 326.] 

F. Important integer-valued distributions. A probability distribution that con¬ 
sists only of integer values can essentially be handled by the techniques described 
at the beginning of this section; but some of these distributions are so important 
in practice, they deserve special mention here. 

(1) The geometric distribution. If some event occurs with probability p, the 
number N of independent trials needed until the event first occurs (or between 
occurrences of the event) has the geometric distribution. We have N = 1 with 
probability p, N = 2 with probability (1 — p)p, ..., N = n with probability 
(1 — p) n—1 p. This is essentially the situation we have already considered in the 
gap test of Section 3.3.2; it is also directly related to the number of times certain 
loops in the algorithms of this section are executed, e.g., steps P1-P3 of the 
polar method. 

A convenient way to generate a variable with this distribution is to set 

JV<-[lntf/ln(l—p)l- (39) 

To check this formula, we observe that [In 17 / ln(l — p)] = n if and only if 
n — 1 < In U / ln(l — p) < n, that is, (1 — p ) n ~ 1 > U > (1 — p) n , and 
this happens with probability p(l — p) n—1 as required. Note that In U can be 
replaced by — Y, where Y has the exponential distribution with mean 1. 

The special case p = \ can be handled more easily on a binary computer, 
since formula (39) becomes N <- [—log 2 U]; that is, N is one more than the 
number of leading zero bits in the binary representation of U. 

(2) The binomial distribution (t, p). If some event occurs with probability p, and 
if we carry out t independent trials, the total number N of occurrences equals n 
with probability (*)p n (l — p) t ~ n . (See Section 1.2.10.) In other words if we 
generate U\, ..., U t , we want to count how many of these are < p. For small t 
we can obtain N in exactly this way. 

For large t, we can generate a beta variate X with integer parameters a 
and b where a b — 1 = t; this effectively gives us the 6th largest of t elements, 
without bothering to generate the other elements. Now if X > p, we set N <— 
Ni where Ni has the binomial distribution (a — l,p/X), since this tells us how 
many of a — 1 random numbers in the range [0,X) are < p; and if X < p, we set 
N <— a + Ni where Ni has the binomial distribution (6 — l,(p — X)/(l —X)), 
since Ni tells us how many of b — 1 random numbers in the range [X, 1) are < p. 
By choosing a = 1 [t/2\, the parameter t will be reduced to a reasonable size 

after about lg t reductions of this kind. (This approach is due to J. H. Ahrens, 
who has also suggested an alternative for medium-sized t ; see exercise 27.) 
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(5) The Poisson distribution with mean fi. This distribution is related to the 
exponential distribution as the binomial distribution is related to the geometric: 
It represents the number of occurrences, per unit time, of an event that can occur 
at any instant of time. For example, the number of alpha particles emitted by 
a radioactive substance in a single second has a Poisson distribution. 

According to this principle, we can produce a Poisson deviate TV by generat¬ 
ing independent exponential deviates X\, X 2 , ... with mean l//z, stopping as 
soon as Xj -(-•••+ X m > 1; then TV *— m — 1. The probability that X\ + 
• • • 4* Xm > 1 is the probability that a gamma deviate of order m is > fi, and 


this comes to / t 


m —1 „—t 


dt/(m — 1)!; hence the probability that TV = n is 


if 

n! J u 


t n e t dt — 


J —f 

- D! J u 


(.n—1„—t 


dt = e 


-uP 


n > 0. 


n\J„ (n — 1)! nV ~ 

If we generate exponential deviates by the logarithm method, the above recipe 

tells us to stop when —(In U\-\ -(-In U m )/fi > 1. Simplifying this expression, 

we see that the desired Poisson deviate can be obtained by calculating e~ li J 
converting it to a fixed point representation, then generating one or more uniform 
deviates U\, U 2 , ... until the product satisfies U\ ... U m < e ~4 finally setting 
TV «— m — 1. On the average this requires the generation of fi -f- 1 uniform 
deviates, so it is a very useful approach when fi is not too large. 

When fj, is large, we can obtain a method of order log fi by using the fact that 
we know how to handle the gamma and binomial distributions for large orders: 
First generate X with the gamma distribution of order m — [otfi\, where a is a 
suitable constant. (Since X is equivalent to —ln(f/i ... U m ), we are essentially 
bypassing m steps of the previous method.) If X < fi, set TV <— ra-f-TVi, where 
Ni is a Poisson deviate with mean [i — X; and if X > /i, set TV TVi, where 
TVi has the binomial distribution (m — 1,/z/X). This method is due to J. H. 
Ahrens and U. Dieter, whose experiments suggest that J is a good choice for a. 

The validity of the above reduction when X > (x is a consequence of the 
following important principle: “Let Xi , ..., X m be independent exponential 
deviates with the same mean; let Sj = Xi -j- • • • + Xj and let V 3 = Sj/S m 
for 1 < j < m. Then the distribution of VT, V 2 , ..., V m —\ is the same as 
the distribution of m — 1 independent uniform deviates sorted into increasing 
order.” To establish this principle formally, we compute the probability that 
V\ < Vi, ..., V m — 1 < v m —i, given the value of S m = s, for arbitrary values 
0 < v\ < • * ■ < v m — 1 < 1: Let f(vi,V 2 ,..., v m —i) be the (m — l)-fold integral 




~V2 S- 




'Vm — 1 $ 1 1 


fie tm- i ^dtm—i'iie ( s u m ; 


then 


f(v i>- • • 1 Vm— 1) Jp dui J Ui dv,2 ■.. f Um _ 2 du m —1 

/(!, 1,..., 1) /„ du 1 ft du 2 ... /T_. du m -1 ’ 
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by making the substitution 1 1 = sui, t\ + t 2 = SU 2 , ..., t\ -\ -1- t m _i = 

su m — i- The latter ratio is the corresponding probability that uniform deviates 
U\, ..., U m —i satisfy U\ < tq, ..., U m —i < i> m _ 1 > given that they also satisfy 

U X <-~<U m -l. 

A more efficient but somewhat more complicated technique for binomial and 
Poisson deviates is sketched in exercise 22. 

G. For further reading. The forthcoming book Non-Uniform Random Numbers 
by J. H. Ahrens and U. Dieter discusses many more algorithms for the genera¬ 
tion of random variables with nonuniform distributions, together with a careful 
consideration of the efficiency of each technique on typical computers. 

From a theoretical point of view it is interesting to consider optimal methods 
for generating random variables with a given distribution, in the sense that 
the method produces the desired result from the minimum possible number of 
random bits. For the beginnings of a theory dealing with such questions, see 
D. E. Knuth and A. C. Yao, Algorithms and Complexity, ed. by J. F. Traub 
(New York: Academic Press, 1976), 357-428. 

Exercise 16 is recommended as a review of many of the techniques in this 
section. 


EXERCISES 

1. [10 J If a and j3 are real numbers with a < d, how would you generate a random 
real number uniformly distributed between a and /?? 

2. [. M16 ] Assuming that mU is a random integer between 0 and m — 1, what is 
the exact probability that [kU J = r, if 0 < r < /c? Compare this with the desired 
probability 1/k. 

► 3. [14] Discuss treating U as an integer and dividing by k to get a random integer 
between 0 and k — 1, instead of multiplying as suggested in the text. Thus (1) would 
be changed to 

ENTA 0 
LDX U 
DIV K 

with the result appearing in register X. Is this a good method? 

4. [M20] Prove the two relations in (8). 

► 5. [21] Suggest an efficient method to compute a random variable with the distribu¬ 
tion px -j- qx 2 rx 3 , where p > 0, q > 0, r > 0, and p -f- q + r = 1. 

6. [HM21] A quantity X is computed by the following method: 

“Step 1. Generate two independent uniform deviates U,V. 

Step 2. If U 2 + V 2 > 1, return to step 1; otherwise set X <— U.” 

What is the distribution function of X? How many times will step 1 be performed? 
(Give the mean and standard deviation.) 
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► 7. [20] (A. J. Walker.) Suppose we have a bunch of cubes of k different colors, say 
rij cubes of color Cj for 1 < j < k, and we also have k boxes {Hi, ... ,Bk) each of 
which can hold exactly n cubes. Furthermore m + • • • + nk — kn, so the cubes will 
just fit in the boxes. Prove (constructively) that there is always a way to put the cubes 
into the boxes so that each box contains at most two different colors of cubes; in fact, 
there is a way to do it so that, whenever box Bj contains two colors, one of those colors 
is Cj. Show how to use this principle to compute the P and Y tables required in (3), 
given a probability distribution (pi ,... ,pk )• 

8. [ M15 ] Show that operation (3) could be changed to 

if U < Pk then X «— xk-\- i otherwise X +-Yk 

(i.e., using the original value of U instead of V) if this were more convenient, by suitably 
modifying P 0 , Pi, ..., Pk~i. 

9. [ HM10 ] Why is the curve f(x) of Fig. 9 concave downward for x < 1, concave 
upward for x > 1? 

► 10. [HM24] Explain how to calculate auxiliary constants Pj, Qj, Yj, Zj, Sj, Dj , Ej 
so that Algorithm M delivers answers with the correct distribution. 

► 11. [HM27] Prove that steps M7-M8 of Algorithm M generate a random variable with 
the appropriate tail of the normal distribution; i.e., the probability that X < x should 
be 

r x 2 / f°° 2 

J e~ t /2 dt / J e~ t /2 dt, x > 3. 

[Pint: Show that it is a special case of the rejection method, with g(t ) — Cte ~ l2//2 for 
some C .j 

12. [HM2S] (R. P. Brent.) Prove that the numbers aj defined in (23) satisfy the 
relation a 2 /2 — a 2 _ x /2 < In2 for all j > 1. [Hint: If f(x ) = e x f^° e ~~ t2 ^ 2 dt, show 
that f(x) < f(y) for 0 < x < y] 

13. [. HM25 ] Given a set of n independent normal deviates, Xi, X 2 , ..., X n , with 
mean 0 and variance 1, show how to find constants bj and a t j, 1 < j < i < n, so that 
if 

Y\ = 6i anXi, Y 2 = &2 H~ U 21 X 1 -[- 0 , 22 X 2 , .. •, 

Yn — b n -\- OnlXi -[- 0 , 12 X 2 H-h OnnXn, 

then Yi, Y 2 , ... ,Y n are dependent normally distributed variables, Yj has mean y,j, and 
the y’s have a given covariance matrix {dj). (The covariance, dj, of Yi and Yj is defined 
to be the average value of (Yi — Hi){Yj — y 3 ). In particular, c 3J is the variance of Yj, 
the square of its standard deviation. Not all matrices (c t y) can be covariance matrices, 
and your construction is, of course, only supposed to work whenever a solution to the 
given conditions is possible.) 

14. [M21] If X is a random variable with continuous distribution F(x), and if c is a 
constant, what is the distribution of cX? 

15. [. HM21 ] If Xi and X 2 are independent random variables with the respective 
distributions Pi(x) and p 2 (x), and with densities fi{x) = Fi'{x), f 2 {x) = F 2 {x ), what 
are the distribution and density functions of the quantity Xi + X 2 ? 
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► 16. [HM22] (J. H. Ahrens.) Develop an algorithm for gamma deviates of order a 
when 0 < a < 1, using the rejection method with cg(t ) = t a ~ 1 /T(a) for 0 < t < 1, 
e~ t /r(a) for t > 1. 

► 17. [M24] What is the distribution function F(x) for the geometric distribution with 
probability p? What is the generating function G[z )? What are the mean and standard 
deviation of this distribution? 

18. [M24] Suggest a method to compute a random integer N for which N takes the 
value n with probability np 2 (l — p) n ~ 1 , n > 0. (The case of particular interest is 
when p is rather small.) 

19. [22] The negative binomial distribution ( t,p ) has integer values N = n with 
probability ( t— J l +n )p t (l — p) n . (Unlike the ordinary binomial distribution, t need not 
be an integer, since this quantity is nonnegative for all n whenever t > 0.) Generalizing 
exercise 18, explain how to generate integers N with this distribution when t is a small 
positive integer. What method would you suggest if t = p = 

20. [M20] Let A be the area of the shaded region in Fig. 13 , and let R be the area of 
the enclosing rectangle. Let I be the area of the interior region recognized by step R2, 
and let E be the area between the exterior region rejected in step R 3 and the outer 
rectangle. Determine the number of times each step of Algorithm R is performed, for 
each of its four variants as in (25), in terms of A, R , I, and E. 

21. [HM29] Derive formulas for the quantities A, R, /, and E defined in exercise 20. 
(For I and especially E you may wish to use an interactive computer algebra system.) 
Show that c = e 1/4 is the best possible constant in step R2 for tests of the form 
“X 2 <4(l + lnc)-4cR.” 

22. [HM40] Can the exact Poisson distribution for large p be obtained by generating 
an appropriate normal deviate, converting it to an integer in some convenient way, and 
applying a (possibly complicated) correction a small percent of the time? 

23. [HM2S] (J. von Neumann.) Are the following two ways to generate a random 
quantity X equivalent (i.e., does the quantity X have the same distribution)? 

Method 1: Set X <— sin((7r/2 )U), where U is uniform. 

Method 2: Generate two uniform deviates, U, V, and if U 2 -(- V 2 > 1, repeat 
until U 2 + V 2 < 1. Then set X <-\U 2 ~ V 2 \/{U 2 + V 2 ). 

24. [HM40] (S. Ulam, J. von Neumann.) Let Vo be a randomly selected real number 

between 0 and 1, and define the sequence (V n ) by the rule V n +i = 4V n (l — V n ). If 
this computation is done with perfect accuracy, the result should be a sequence with 
the distribution sin 2 7 tU, where U is uniform, i.e., with distribution function F(x) = 
fj dxf — x). For if we write V n = sin 2 7r U n , we find that U n + 1 — (2U n ) mod 1; 

and by the fact that almost all real numbers have a random binary expansion (see 
Section 3.5), this sequence U n is equidistributed. But if the computation of V n is done 
with only finite accuracy, the above argument breaks down because we soon are dealing 
with noise from the roundoff error. [Reference: von Neumann’s Collected Works 5, 
768-770.] 

Analyze the sequence (V n ) defined above when only finite accuracy is present, both 
empirically (for various different choices of Vo) and theoretically. Does the sequence 
have a distribution resembling the expected distribution? 
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25. [M25] Let X\, X2, ..., X5 be binary words each of whose bits is independ¬ 
ently 0 or 1 with probability ^. What is the probability that a given bit position of 
Xi V {X2 A {X3 V (X 4 A X 5 ))) contains a 1? Generalize. 

26. [M18\ Let Ni and N 2 be independent Poisson deviates with respective means 
Mi and (x 2 , where Mi > M 2 > 0. Prove or disprove: (a) Ni -(- JV 2 has the Poisson 
distribution with mean mi + M 2 - (b) Ni — N 2 has the Poisson distribution with mean 
Mi — M2- 

27. [22] (J. H. Ahrens.) On most binary computers there is an efficient way to count 
the number of l’s in a binary word (cf. Section 7.1). Hence there is a nice way to obtain 
the binomial distribution (t,p) when p = simply by generating t random bits and 
counting the number of l’s. 

Design an algorithm that produces the binomial distribution (t, p) for arbitrary p, 
using only a subroutine for the special case p = § as a source of random data. [Hint: 
Simulate a process that first looks at the most significant bits of t uniform deviates, 
then at the second bit of those deviates whose leading bit is not sufficient to determine 
whether or not their value is < p, etc.] 

28. [HM35] (R. P. Brent.) Develop a method to generate a random point on the 
surface of the ellipsoid defined by Yh a ^ x k = 1 , where ai > • • • > a n > 0 . 

29. [M20] (J. L. Bentley and J. D. Saxe.) Fint a simple way to generate n numbers 

Xi, X n that are uniform between 0 and 1 except for the fact that they are sorted: 

X x < < X n . Your algorithm should take only 0(n ) steps. 


3.4.2. Random Sampling and Shuffling 

Many data processing applications call for an unbiased choice of n records at 
random from a file containing N records. This problem arises, for example, in 
quality control or other statistical calculations where sampling is needed. Usually 
N is very large, so that it is impossible to contain all the data in memory at once; 
and the individual records themselves are often very large, so that we can’t even 
hold n records in memory. Therefore we seek an efficient procedure for selecting 
n records by deciding either to accept or to reject each record as it comes along, 
writing the accepted records onto an output file. 

Several methods have been devised for this problem. The most obvious 
approach is to select each record with probability n/N ; this may sometimes 
be appropriate, but it gives only an average of n records in the sample. The 
standard deviation is y/n( 1 — n/JV), and it is possible that the sample will be 
either too large for the desired application, or too small to give the necessary 
results. 

A simple modification of the “obvious” procedure gives what we want: The 
(t -j- l)st record should be selected with probability (n — m)/(N — t), if m items 
have already been selected. This is the appropriate probability, since of all the 
possible ways to choose n things from N such that m values occur in the first t, 
exactly 



N-t 


(i) 
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of these select the (t + l)st element. 

The idea developed in the preceding paragraph leads immediately to the 
following algorithm: 

Algorithm S (Selection sampling technique). To select n records at random from 
a set of N, where 0 < n < iV. 

51. [Initialize.] Set t <— 0, m <— 0. (During this algorithm, m represents the 
number of records selected so far, and t is the total number of input records 
we have dealt with.) 

52. [Generate U.} Generate a random number U, uniformly distributed between 
zero and one. 

53. [Test.] If (N — t)U > n — m, go to step S5. 

54. [Select.] Select the next record for the sample, and increase m and t by 1. If 
m < n, go to step S2; otherwise the sample is complete and the algorithm 
terminates. 

55. [Skip.] Skip the next record (do not include it in the sample), increase t 

by 1, and go to step S2. | 

This algorithm may appear to be unreliable at first glance and, in fact, to 
be incorrect; but a careful analysis (see the exercises below) shows that it is 
completely trustworthy. It is not difficult to verify that 

a) At most N records are input (we never run off the end of the file before 
choosing n items). 

b) The sample is completely unbiased; in particular, the probability that any 
given element is selected, e.g., the last element of the file, is n/N. 

Statement (b) is true in spite of the fact that we are not selecting the (i-(-l)st 
item with probability n/N, we select it with the probability in Eq. (1)! This has 
caused some confusion in the published literature. Can the reader explain this 
seeming contradiction? 

(Note: When using Algorithm S, one should be careful to use a different 
source of random numbers U each time the program is run, to avoid connections 
between the samples obtained on different days. This can be done, for example, 
by choosing a different value of X Q for the linear congruential method each time; 
Xo could be set to the current date, or to the last X value generated on the 
previous run of the program.) 

We will usually not have to pass over all N records; in fact, since (b) above 
says that the last record is selected with probability n/N, we will terminate the 
algorithm before considering the last record exactly (1 — n/N) of the time. The 
average number of records considered when n = 2 is about §X, and the general 
formulas are given in exercises 5 and 6. 

Algorithm S and a number of other sampling techniques are discussed in a 
paper by C. T. Fan, Mervin E. Muller, and Ivan Rezucha, J. Amer. Stat. Assoc. 
57 (1962), 387-402. The method was independently discovered by T. G. Jones, 
CACM 5 (1962), 343. 
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A problem arises if we don’t know the value of N in advance, since the 
precise value of N is crucial in Algorithm S. Suppose we want to select n items 
at random from a file, without knowing exactly how many are present in that 
file. We could first go through and count the records, then take a second pass 
to select them; but it is generally better to sample m > n of the original items 
on the first pass, where m is much less than N, so that only m items must be 
considered on the second pass. The trick, of course, is to do this in such a way 
that the final result is a truly random sample of the original file. 

Since we don’t know when the input is going to end, we must keep track of 
a random sample of the input records seen so far, thus always being prepared 
for the end. As we read the input we will construct a “reservoir” that contains 
only those m records that have appeared among the previous samples. The first 
n records always go into the reservoir. When the (t + l)st record is being input, 
for t > n, we will have in memory a table of n indices pointing to those records 
in the reservoir that belong to the random sample we have chosen from the first 
t records. The problem is to maintain this situation with t increased by one, 
namely to find a new random sample from among the t + 1 records now known 
to be present. It is not hard to see that we should include the new record in the 
new sample with probability n/(t + 1), and in such a case it should replace a 
random element of the previous sample. 

Thus, the following procedure does the job: 

Algorithm R ( Reservoir sampling). To select n records at random from a file of 
unknown size > n, given n > 0. An auxiliary file called the “reservoir” contains 
all records that are candidates for the final sample. The algorithm uses a table 
of distinct indices I[j] for 1 < j < n, each of which points to one of the records 
in the reservoir. 

Rl. [Initialize.] Input the first n records and copy them to the reservoir. Set 
I[j] <- j for 1 < j < n, and set t <— m <— n. (If the file being sampled has 
fewer than n records, it will of course be necessary to abort the algorithm and 
report failure. During this algorithm, {/[l],..., I[n]} point to the records 
in the current sample, m is the size of the reservoir, and t is the number of 
input records dealt with so far.) 

R2. [End of file?] If there are no more records to be input, go to step R6. 

R3. [Generate and test.] Increase t by 1, then generate a random integer M 
between 1 and t (inclusive). If M > n, go to R5. 

R4. [Add to reservoir.] Copy the next record of the input file to the reservoir, 
increase m by 1, and set I[M] <— m. (The record previously pointed to by 
I[M] is being replaced in the sample by the new record.) Go back to R2. 

R5. [Skip.] Skip over the next record of the input file (do not include it in the 
reservoir), and return to step R2. 

R6. [Second pass.] Sort the I table entries so that I[ 1] < ••• < I[n\; then go 
through the reservoir, copying the records with these indices into the output 
file that is to hold the final sample. | 
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Algorithm R is due to Alan G. Waterman. The reader may wish to work 
out the example of its operation that appears in exercise 9. 

If the records are sufficiently short, it is of course unnecessary to have a 
reservoir at all; we can keep the n records of the current sample in memory at 
all times (see exercise 10). 

The natural question to ask about Algorithm R is, “What is the expected 
size of the reservoir?” Exercise 11 shows that the average value of m is exactly 
n( 1 + Hn — Hn)’, this is approximately n(l -f- In (N/rij). So if N/n = 1000, the 
reservoir will contain only about ^ as many items as the original file. 

Note that Algorithms S and R can be used to obtain samples for several 
independent categories simultaneously. For example, if we have a large file of 
names and addresses of U.S. residents, we could pick random samples of exactly 
10 people from each of the 50 states without making 50 passes through the file, 
and without first sorting the file by state. 

The sampling problem can be regarded as the computation of a random 
combination, according to the conventional definition of combinations of N 
things taken n at a time (see Section 1.2.6). Now let us consider the problem 
of computing a random permutation of t objects; we will call this the shuffling 
problem, since shuffling a deck of cards is nothing more than subjecting it to a 
random permutation. 

A moment’s reflection is enough to convince oneself that the approaches 
people traditionally use to shuffle cards are miserably inadequate; there is no hope 
of obtaining each of the t! permutations with anywhere near equal probability 
by such methods. It has been said that expert bridge players make use of this 
fact when deciding whether or not to “finesse.” 

If t is small, we can obtain a random permutation very quickly by generating 
a random integer between 1 and t\. For example, when t = 4, a random number 
between 1 and 24 suffices to select a random permutation from a table of all 
possibilities. But for large t, it is necessary to be more careful if we want to 
claim that each permutation is equally likely, since t\ is much larger than the 
accuracy of individual random numbers. 

A suitable shuffling procedure can be obtained by recalling Algorithm 3.3.2P, 
which gives a simple correspondence between each of the t\ possible permutations 
and a sequence of numbers (c l7 C 2 ,..., c t _i), with 0 < c 3 < j. It is easy to 
compute such a set of numbers at random, and we can use the correspondence 
to produce a random permutation. 

Algorithm P ( Shuffling ). Let X lr X^, ..., X t be a set of t numbers to be shuffled. 
PI. [Initialize.] Set j <— t. 

P2. [Generate U.} Generate a random number U, uniformly distributed between 

zero and one. 

P3. [Exchange.] Set k <— [jU J -j- 1. (Now A; is a random integer, between 1 

and j.) Exchange X k +-»■ Xj. 

P4. [Decrease j] Decrease j by 1. If j > 1, return to step P2. | 
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This algorithm was first published by L. E. Moses and R. V. Oakford, in 
Tables of Random Permutations (Stanford University Press, 1963); and by R. 
Durstenfeld, CACM 7 (1964), 420. It can also be modified to obtain a random 
permutation of a random combination (see exercise 14). 

For a discussion of random combinatorial objects of other kinds (e.g., parti¬ 
tions), see Section 7.2 and/or the book Combinatorial Algorithms by Nijenhuis 
and Wilf (New York: Academic Press, 1975). 


EXERCISES 

1. [M12] Explain Eq. (1). 

2. [20] Prove that Algorithm S never tries to read more than N records of its input 
file. 

► 3. [22] The (t-j-l)st item in Algorithm S is selected with probability (n—m)/(N —t), 
not n/N, yet the text claims that the sample is unbiased—so each item should be 
selected with the same probability. How can both of these statements be true? 

4. [M23] Let p(m, t) be the probability that exactly m items are selected from among 
the first t in the selection sampling technique. Show directly from Algorithm S that 

p(m, t) = ( 1 1C for 0 < t < N. 

\mj\n — m)/ \nJ ~ 

5. [M24] What is the average value of t when Algorithm S terminates? (In other 
words, how many of the N records have been passed, on the average, before the sample 
is complete?) 

6. [ M24] What is the standard deviation of the value computed in the previous 
exercise? 

► 7 . [M25] Prove that any given choice of n records from the set of N is obtained by 
Algorithm S with probability l/(^). Therefore the sample is completely unbiased. 

8. [M 46 ] Algorithm S computes one uniform deviate for each input record it handles. 
Find a more efficient way to determine the number of input records to skip before the 
first is selected, assuming that N/n is rather large. (We could iterate this process to 
select the remaining n — 1 records, thus reducing the number of necessary random 
deviates from order N to order n.) 

9. [12] Let n = 3. If Algorithm R is applied to a file containing 20 records numbered 
1 thru 20, and if the random numbers generated in step R3 are respectively 

4,1,6,7,5,3,5,11,11,3,7,9,3,11,4,5,4, 

which records go into the reservoir? Which are in the final sample? 

10 . [15] Modify Algorithm R so that the reservoir is eliminated, assuming that the n 
records of the current sample can be held in memory. 

► 11 . [ M25] Let p m be the probability that exactly m elements are put into the reservoir 
during the first pass of Algorithm R. Determine the generating function G(z ) = 

PmZ m , and find the mean and standard deviation. (Use the ideas of Section 1.2.10.) 
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12. [M26] The gist of Algorithm P is that any permutation n can be uniquely written 
as a product of transpositions in the form 7r = (a t t )... (a33)(a22), where 1 < a 3 < j 
for £ > j > 1. Prove that there is also a unique representation of the form 7r = 
(& 22 )( 633 ) • • • (btt), where 1 < bj < j for 1 < j < t, and design an algorithm that 
computes the 6’s from the a’s in 0(t ) steps. 

13. [ M23] (S. W. Golomb.) One of the most common ways to shuffle cards is to divide 
the deck into two parts as equal as possible, and to “riffle” them together. (See the 
discussion of card-playing etiquette in Hoyle’s rules of card games; we read, “A shuffle 
of this sort should be made about three times to mix the cards thoroughly.”) Consider 
a deck of 2n — 1 cards Xi, X 2 , ..., X 2 n — 1 ; a “perfect shuffle” s divides this deck 
into Xi, X 2 , ..., X n and X n + 1 , ..., X 2 n — 1 , then perfectly interleaves them to obtain 
Xi, X n _|_i, X 2 , Xn-f 2 , ..., X 2 n— 1 , X n . The “cut” operation c 7 changes X\, X 2 , ..., 
X 2 n —1 into Xj+i, ..., X 2 n— 1 , Xi, ..., Xj. Show that by combining perfect shuffles 
and cuts, at most (2 n — 1)(2 n — 2) different arrangements of the deck are possible, if 
n > 1. 

► 14. [80] (Ole-Johan Dahl.) If Xk — k for 1 < k < t at the start of Algorithm P, and 
if we terminate the algorithm when j reaches the value t — n, the sequence X t — n +i, 
..., X t is a random permutation of a random combination of n elements. Show how 
to simulate the effect of this procedure using only O(n) cells of memory. 

► 15. [ M25] Devise a way to compute a random sample of n records from N, given N 
and n, based on the idea of hashing (Section 6.4). Your method should use O(n) storage 
locations and an average of 0(n ) units of time, and it should present the sample as a 
sorted set of integers 1 < Xi < X 2 < • * • < X n < N. 

16. [25] Discuss the problem of weighted sampling, where each subset of n elements 
is obtained with probability proportional to the product of the weights of the elements. 
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3.5. WHAT IS A RANDOM SEQUENCE? 

A. Introductory remarks. We have seen in this chapter how to generate sequences 

(Un) = U Q , Uu u 2 , ... ( 1 ) 

of real numbers in the range 0 < U n < 1, and we have called them “random” 
sequences even though they are completely deterministic in character. To justify 
this terminology, we claimed that the numbers “behave as if they are truly 
random.” Such a statement may be satisfactory for practical purposes (at the 
present time), but it sidesteps a very important philosophical and theoretical 
question: Precisely what do we mean by “random behavior”? A quantitative 
definition is needed. It is undesirable to talk about concepts that we do not 
really understand, especially since many apparently paradoxical statements can 
be made about random numbers. 

The mathematical theory of probability and statistics carefully sidesteps 
the question; it refrains from making absolute statements, and instead expresses 
everything in terms of how much probability is to be attached to statements 
involving random sequences of events. The axioms of probability theory are 
set up so that abstract probabilities can be computed readily, but nothing is 
said about what probability really signifies, or how this concept can be applied 
meaningfully to the actual world. In the book Probability , Statistics , and Truth 
(New York: Macmillan, 1957), R. von Mises discusses this situation in detail, and 
presents the view that a proper definition of probability depends on obtaining a 
proper definition of a random sequence. 

Let us paraphrase here some statements made by two of the many authors 
who have commented on the subject. 

D. H. Lehmer (1951): “A random sequence is a vague notion embodying the 
idea of a sequence in which each term is unpredictable to the uninitiated 
and whose digits pass a certain number of tests, traditional with statisticians 
and depending somewhat on the uses to which the sequence is to be put.” 

J. N. Franklin (1962): “The sequence (1) is random if it has every property 
that is shared by all infinite sequences of independent samples of random 
variables from the uniform distribution.” 

Franklin’s statement essentially generalizes Lehmer’s to say that the se¬ 
quence must satisfy all statistical tests. His definition is not completely precise, 
and we will see later that a reasonable interpretation of his statement leads us to 
conclude that there is no such thing as a random sequence! So let us begin with 
Lehmer’s less restrictive statement and attempt to make it precise. What we 
really want is a relatively short list of mathematical properties, each of which is 
satisfied by our intuitive notion of a random sequence; furthermore, the list is to 
be complete enough so that we are willing to agree that any sequence satisfying 
these properties is “random.” In this section, we will develop what seems to be 
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an adequate definition of randomness according to these criteria, although many 
interesting questions remain to be answered. 

Let u and v be real numbers, 0 < u < v < 1. If U is a random variable 
that is uniformly distributed between 0 and 1, the probability that u < U < v 
is equal to v — u. For example, the probability that J < U < § is J. How can 
we translate this property of the single number U into a property of the infinite 
sequence Uo, U x , U 2 , .. . ? The obvious answer is to count how many times U n 
lies between u and v, and the average number of times should equal v — u. Our 
intuitive idea of probability is based in this way on the frequency of occurrence. 

More precisely, let i/(n) be the number of values of j, 0 < j < n, such 
that u < Uj < v; we want the ratio v(n)/n to approach the value v — u as n 
approaches infinity: 

lim v{n)jn — v — u. (2) 

n—*■ oo 

If this condition holds for all choices of u and v, the sequence is said to be 
equidistributed. 

Let S(n) be a statement about the integer n and the sequence U\, U 2 , ... ; 
for example, S(n) might be the statement considered above, “u < U n < v.” We 
can generalize the idea used in the preceding paragraph to define “the probability 
that S(n) is true” with respect to a particular infinite sequence: Let 1 /( 71 ) be the 
number of values of j , 0 < j < n, such that S(j) is true. 

Definition A. We say Pr(5(n)) — X, if lim n ^o 0 v(n)/n = X. (Read, “The 
probability that S(ri) is true equals X, if the limit as n tends to infinity of v(n)/n 
equals X.”) 

In terms of this notation, the sequence Uo, U\, ... is equidistributed if and 
only if Pt(u < U n < v) — v — u, for all real numbers u, v with 0 < u < v < 1. 

A sequence may be equidistributed without being random. For example, if 
Uo, U\, ... and Vo, Vi, ... are equidistributed sequences, it is not hard to show 
that the sequence 


Wb,Wi,^a,W3,... =l^o, i + JVo, Wi, h + hVu ••• (3) 

is also equidistributed, since the subsequence \Uo, \U\, ... is equidistributed 
between 0 and J, while the alternate terms J -f- jVo, \ + \V\, ..., are equi¬ 
distributed between J and 1. In the sequence of W’s, a value less than \ is 
always followed by a value greater than or equal to J, and conversely; hence the 
sequence is not random by any reasonable definition. A stronger property than 
equidistribution is needed. 

A natural generalization of the equidistribution property, which removes 
the objection stated in the preceding paragraph, is to consider adjacent pairs of 
numbers of our sequence. We can require the sequence to satisfy the condition 

Pr(ui <U n <v 1 and u 2 < U n+X < v 2 ) = {v x — Ui)(v 2 — u 2 ) (4) 

for any four numbers Ui,v x ,U 2 , V 2 with 0 < u x < v x < 1, 0 < U 2 < t >2 < 1. In 
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general, for any positive integer k we can require our sequence to be k-distributed 
in the following sense: 

Definition B. The sequence (1) is said to be k-distributed if 


Pr(ui < U n < v 1 ,...,u h < U n +k -i < v k ) = (i>i — Ui )... (v k — u k ) (5) 


for all choices of real numbers Uj, v 3 , with 0 < Uj < v 3 < 1, for 1 < j < k. 

An equidistributed sequence is a 1-distributed sequence. Note that if k > 1, 
a /c-distributed sequence is always (k — l)-distributed, since we may set u k = 0 
and v k = 1 in Eq. (5). Thus, in particular, any sequence that is known to be 
4-distributed must also be 3-distributed, 2-distributed, and equidistributed. We 
can investigate the largest k for which a given sequence is k- distributed; and this 
leads us to formulate 

Definition C. A sequence is said to be oo-distributed if it is k-distributed for all 
positive integers k. 

So far we have considered “[0,1) sequences,” i.e., sequences of real numbers 
lying between zero and one. The same ideas apply to integer-valued sequences; 
let us say a sequence { X n ) — Xq, Xi, X 2 , ... is a “6-ary sequence” if each X n is 
one of the integers 0, 1, ..., 6— 1. Thus, a 2-ary (binary) sequence is a sequence 
of zeros and ones. 

We also say that a fc-digit “6-ary number” is a string of k integers xix 2 ... x k , 
where 0 < Xj < 6 for 1 < j < k. 

Definition D. A b-ary sequence is said to be k-distributed if 

Pr(X n X n _j_i.. ,X n -^-k —i == Xix 2 • • • *£fc) == 1/6 (6) 

for all b-ary numbers X\X 2 .. .x k . 

It is clear from this definition that if Uq, Ui, ... is a fc-distributed [0,1) 
sequence, then the sequence L^oJj [bUi\, ... is a k -distributed 6-ary sequence. 
(If we set Uj = Xj/b, Vj = (xj + l)/&, AT n = \bU n J, Eq. (5) becomes Eq. (6).) 
Furthermore, every ^-distributed 6-ary sequence is also (k — l)-distributed, if 
A: > 1: we add together the probabilities for the 6-ary numbers X\ ... x k —i 0, 
xi ... x k —i 1, ..., xi ... Xk—i (6 — 1) to obtain 


Pr(X n .. .X n+fc _ 2 = Xi .. .Xfc_i) = l/b k l . 

(Probabilities for disjoint events are additive; see exercise 5.) It therefore is 
natural to speak of an oo-distributed 6-ary sequence, as in Definition C above. 

The representation of a positive real number in the radix-6 number system 
may be regarded as a 6-ary sequence; for example, 7r corresponds to the 10-ary 
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sequence 3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, ... . It has been conjectured that this 
sequence is oo-distributed, but nobody has yet been able to prove that it is even 
1-distributed. 

Let us analyze these concepts a little more closely in the case when k equals 
a million. A binary sequence that is 1000000-distributed is going to have runs of 
a million zeros in a row! Similarly, a [0,1) sequence that is 1000000-distributed 
is going to have runs of a million consecutive values each of which is less than \. 
It is true that this will happen only (i jioooooo 0 f y me? on the average, but 
the fact is that it does happen. Indeed, this phenomenon will occur in any truly 
random sequence, using our intuitive notion of “truly random.” One can easily 
imagine that such a situation will have a drastic effect if this set of a million 
“truly random” numbers is being used in a computer-simulation experiment; 
there would be good reason to complain about the random number generator. 
However, if we have a sequence of numbers that never has runs of a million 
consecutive (7’s less than the sequence is not random, and it will not be a 
suitable source of numbers for other conceivable applications that use extremely 
long blocks of IT s as input. In summary, a truly random sequence will exhibit 
local nonrandomness. Local nonrandomness is necessary in some applications, 
but it is disastrous in others. We are forced to conclude that no sequence of 
“random” numbers can be adequate for every application. 

In a similar vein, one may argue that there is no way to judge whether a 
finite sequence is random or not; any particular sequence is just as likely as any 
other one. These facts are definitely stumbling blocks if we are ever to have a 
useful definition of randomness, but they are not really cause for alarm. It is 
still possible to give a definition for the randomness of infinite sequences of real 
numbers in such a way that the corresponding theory (viewed properly) will give 
us a great deal of insight concerning the ordinary finite sequences of rational 
numbers that are actually generated on a computer. Furthermore, we shall see 
later in this section that there are several plausible definitions of randomness for 
finite sequences. 

B. oo- distributed sequences. Let us now undertake a brief study of the theory 
of sequences that are oo-distributed. To describe the theory adequately, we will 
need to use a bit of higher mathematics, so we assume in the remainder of this 
subsection that the reader knows the material ordinarily taught in an “advanced 
calculus” course. 

First it is convenient to generalize Definition A, since the limit appearing 
there does not exist for all sequences. Let us define 

Pr(S'(n)) = limsup (y(n)/n), Pr(5(n)) = liminf (i/(n)/n). (7) 


Then Pr(5(n)), if it exists, is the common value of Pr(£(n)) and Pr(S(n)). 

We have seen that a ^-distributed [0,1) sequence leads to a ^-distributed 
b- ary sequence, if U is replaced by [bU\. Our first theorem shows that a converse 
result is also true. 
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Theorem A. Let (U n ) = Uq, Ui, U 2 , ... be a [0,1) sequence. If the sequence 


(LWI> = IW>J> L'WiJ, [b 3 U 2 \, ... 


is a k-distributed bj-ary sequence for all bj in an infinite sequence of integers 
1 < h < 62 < 63 < • ••, then the original sequence (U n ) is k-distributed. 

As an example of this theorem, suppose that bj = 2 3 . The sequence 
[2 J Uo\, [2 3 Ui\, ... is essentially the sequence of the first j bits of the binary 
representations of Uq, U\, ... . If all these integer sequences are /c-distributed, 
in the sense of Definition D, then the real-valued sequence Uq, U\, ... must also 
be /c-distributed in the sense of Definition B. 

Proof of Theorem A. If the sequence [bUo\, \ bU\ J, ... is /c-distributed, it follows 
by the addition of probabilities that Eq. (5) holds whenever each Uj and v 3 is a 
rational number with denominator b. Now let Uj, Vj be any real numbers, and 
let u'j, v' 3 be rational numbers with denominator b such that 

u'j < Uj < u'j + 1 /b, v'j < Vj < v'j + 1/6. 

Let S(n) be the statement that u\ < U n < Vi, ..., Uk < U n +ic —1 < Vk- We 
have 



Now | (v'j — u ' 3 ^ 1/6) — (vj — Uj )| < 2/6; since our inequalities hold for all 
b = bj, and since bj —> 00 as j —► 00 , we have 


(vi — Ui )... (v k — u k ) < Pr(5(n)) < Pr(5(n)) < (vi — Ui). .. {v k — u k ). | 

The next theorem is our main tool for proving things about /c-distributed 
sequences. 

Theorem B. Let (U n ) be a k-distributed [0,1) sequence, and let f(xi,x 2 , ..., x k ) 
be a Riemann-integrable function of k variables; then 


lim — 

n—*•00 Ti 


E 

0 <j<n 


f{u it u j+ 1 , 




^2) • * ■ j ^1 • 


(8) 
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Proof. The definition of a /c-distributed sequence states that this result is true 
in the special case that 



if tii < xi < Vi,... ,u k < x k < v k ; 
otherwise. 



Therefore Eq. (8) is true whenever f = aifi +^ 2/2 H-b a mfm and when each 

fj is a function of type (9); in other words, Eq. (8) holds whenever / is a “step- 
function” obtained by (i) partitioning the unit /c-dimensional cube into subcells 
whose faces are parallel to the coordinate axes, and (ii) assigning a constant 
value to / on each subcell. 

Now let / be any Riemann-integrable function. If e is any positive number, 
we know (by the definition of Riemann-integrability) that there exist step func¬ 
tions / and / such that /(zi,. .., x k ) < f(x 1, . .., x k ) < f(x 1,..., x k ), and such 
that the difference of the integrals of / and / is less than e. Since Eq. (8) holds 
for / and /, and since 


l £ ny j ,...,u j+k _ 1 )<l £ f{Uj,Uj+k—i) 

0<j<n 0 <j<n 

<1 £ 7M . u j+k - 1 ), 


we conclude that Eq. (8) is true also for /. | 

Theorem B can be applied, for example, to the permutation test of Section 
3.3.2. Let (pi,p 2 » • • • ,Pk) be any permutation of the numbers {1,2,... ,k}; we 
want to show that 


Pr(t/ n _|_p 1 —1 < U n +p 2 —1 < • < U n -^-p k — 1 ) — l/k\. (10) 


To prove this, assume that the sequence ( U n ) is /c-distributed, and let 


/(z 1 ,.. 



if z Pl < x P2 < • • • < x Pk ) 
otherwise. 


We have 

Pr(t/ n -)-p 1 —1 "C U n j r p 2 —1 * <C U n -\-p k —1) 

f 1 

1 f(x x ,... ,x k )dx! ...dx k 
0 

r x Pk r x P3 r x P2 ^ 

Jo "Jo dX ” Jo dXp ' = k! 

Corollary P. If a [0,1) sequence is k-distributed, it satisfies the permutation test 
of order k, in the sense of Eq. (10). | 
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We can also show that the serial correlation test is satisfied: 

Corollary S. If a [0,1) sequence is (k + l)-distributed, the serial correlation 
coefficient between U n and U n + k tends to zero: 

_ n S UjUj+k ~(n£ ^j'Xn ^ + _ _ g 

yftt Zuf-tiZujjti£ uj +k -(sE u j+k ) a ) 

(All summations here are for 0 < j < n.) 

Proof. By Theorem B, the quantities 

iE UjU i+k , izuf, ££^> iE^+* 

tend to the respective limits J, J, J, i, Jasn—►oo. | 

Let us now consider some slightly more general distribution properties of 
sequences. We have defined the notion of /c-distribution by considering all of 
the adjacent A:-tuples; for example, a sequence is 2-distributed if and only if the 
points 

(C7 0 ,t/i), (t/i,C7 2 ), (U 2 ,U 3 ), (U 3 ,U 4 ), (U 4 ,U S ), ... 

are equidistributed in the unit square. It is quite possible, however, that this can 
happen while alternate pairs of points (Li,I/ 2 ), (U^, I/ 4 ), (U^,Uq), ... are not 
equidistributed; if the density of points (17 2n _ i,^ 2 n) is deficient in some area, 
the other points ([/ 2 n> & 2 n+i) might compensate. For example, the periodic 
binary sequence 

(X n ) = 0,0,0,1, 0,0,0,1, 1,1,0,1, 1,1,0,1, 0,0,0,1, ..., (11) 

with a period of length 16, is seen to be 3-distributed; yet the sequence of even- 
numbered elements (X 2n ) = 0, 0, 0, 0, 1, 0, 1, 0, ... has three times as many 
zeros as ones, while the subsequence of odd-numbered elements (X 2n +i) = 0, 1, 
0, 1, 1, 1, 1, 1, ... has three times as many ones as zeros. 

If a sequence (U n ) is oo-distributed, example (11) shows that it is not at all 
obvious that the subsequence of alternate terms (I/ 2n ) = Uq, I/ 2 , 1/4, U&, ... will 
be oo-distributed or even 1-distributed. But we shall see that (I/ 2n ) is, in fact, 
oo-distributed, and much more is true. 

Definition E. A [0,1) sequence (U n ) is said to be (m, k)-distributed if 

Pr(tti ^ Umn-\-j 'C ^ 1 , U2 ^ ^ ^2> ■ • • ? ^ Umn-{-j-\-k —1 ^fc) 

= {vi —u 1 )...(v k — U k ) 

for all choices of real numbers u r , v r with 0 < u r < v r < 1 for 1 < r < k, and 
for all integers j with 0 < j < m. 
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Thus a ^-distributed sequence is the special case m — 1 in Definition E; the case 
m = 2 means that the /c-tuples starting in even positions must have the same 
density as the /c-tuples starting in odd positions, etc. 

Several properties of Definition E are obvious: 

An ( m, k)-distributed sequence is (m, /c)- distributed for 1 < k < k. (12) 

An (m, k)-distributed sequence is (d, k)-distributed for all divisors d of m. (13) 

We can also define the concept of an (m, k)- distributed 6-ary sequence, as in 
Definition D; and the proof of Theorem A remains valid for (m, /^-distributed 
sequences. 

The next theorem, which is in many ways rather surprising, shows that the 
property of being oo-distributed is very strong indeed, much stronger than we 
imagined it to be when we first considered the definition of the concept. 

Theorem C (Ivan Niven and H. S. Zuckerman). An oc -distributed sequence is 
( m , k)-distributed for all positive integers m and k. 

Proof. It suffices to prove the theorem for 6-ary sequences, by using the gen¬ 
eralization of Theorem A just mentioned. Furthermore, we may assume that 
m =-k, because (12) and (13) tell us that the sequence will be (m, /c)-distributed 
if it is (mk, m/c)-distributed. 

So we will prove that any oo-distributed 6-ary sequence Xo, X\, ... is (m, re¬ 
distributed for all positive integers m. Our proof is a simplified version of the 
original one given by Niven and Zuckerman in Pacific J. Math. 1 (1951), 103-109. 

The key idea we shall use is an important technique that applies to many 
situations in mathematics: “If the sum of m quantities and the sum of their 
squares are both consistent with the hypothesis that the m quantities are equal, 
then that hypothesis is true.” In a strong form, this principle may be stated as 
follows: 

Lemma E. Given m sequences of numbers (yj n ) = yjo, y 3 \, ... for 1 < j < m, 
suppose that 

lim {Vln + 2/2 n H-+ 2/mn) = 

n—*°o / 14 n 

limsup( 2 /? n + r/ 2 n H- ^y 2 mn )<ma 2 . 

n—* oo 

Then for each j , lim n ^oo yj n exists and equals a. 

An incredibly simple proof of this lemma is given in exercise 9. | 

Resuming our proof of Theorem C, let x = X\X 2 ... x m be a 6-ary number, 
and say that x occurs at position p if X p _ m+ iX p _ m 4 _ 2 ...X p = x. Let i^-(n) 
be the number of occurrences of x at position p when p < n and pmodm = j. 
Let yj n = i/j(n)/n; we wish to prove that 

lim yj n = lfmb m . 
n—>oo 


( 15 ) 




150 RANDOM NUMBERS 


3.5 


First we know that 

{VOn + Vln H-+ V{m—l)n) — l/b™, (16) 

n—too 

since the sequence is m-distributed. By Lemma E and Eq. (16), the theorem will 
be proved if we can show that 

lim sup (t/§„ + y 2 ln -\ -b yf m - i)J < l/mb 2m . (17) 

n—tOO 

This inequality is not obvious yet; some rather delicate maneuvering is 
necessary before we can prove it. Let q be a multiple of m, and consider 

C(n)= £ P'iW-£(«- 9)\ (18) 

This is the number of pairs of occurrences of x in positions pj, p 2 with n — q < 
Pi < P 2 < n and with P 2 —P 1 a multiple of m. Consider now the sum 

S N = Y, C(n). (19) 

l<n< AT-f-q 

Each pair of occurrences of x in positions p \, p 2 with Pi < P 2 < Pi + q, where 

V 2 — Pi is a multiple of m and p\ < N, is counted pi q — P 2 times in the 

total S N (namely, when p 2 < n < pi -j- q); and the pairs of such occurrences 
with N < pi < p 2 < TV + q are counted N + q — p 2 times. 

Let d t (ri) be the number of pairs of occurrences of x in positions pi,p 2 with 
Pi + t = p 2 < n. The analysis above shows that 

^2 (Q — mt)d mt (N + q) > S N > ^ (q — mt)d mt {N). (20) 

0 <t<q/m 0 <t<q/m 

Since the original sequence is q-distributed, 

lim ± dmt (N) = l/b 2m (21) 

N-tO O iV 

for all t, 0 < t < q/m, and therefore by (20) we have 

^lim = ^2 {q — mt)/b 2rn = q(q — m)/2mb 2rn . (22) 

0 <t<q/m 

This fact will prove the theorem, after some manipulation. 
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By definition, 

2 Sn= E {(vj(n) - Vj(n — q)f — (vj(n) — Vj(n -<?))), 

l<n<iV-}-g 0 <j<m 

and we can remove the unsquared terms by applying (16) to get 

lim ^ = q(q — m)/mb 2rn + q/b m , (23) 

N—*oo N 


where 

t n = E { u i( n ) ~ ~ q)f. 

1 <n<N-\-q 0 <j<m 

Using the inequality 



E s 2 

i<;< r 


(cf. exercise 1.2.3-30), we find that 

<■ 

UmSU P E vi v'l E ("» - - <?))) 

0<j<m + ^M<n<JV+g ' 


< q(q — m)/mb 2m -(- q/b m . 


(24) 


We also have 

qV j( N ) < ^( n )= S (^( n ) — v A n — v))<q vj( N + <7)> 

N<n<N-\-q 1 <n<N+q 


and putting this into (24) gives 

lim sup E (uj{N)/N) 2 <(q- m)/qmb 2m + l/qb m . (25) 

N—KXi o <j<m 

This formula has been established whenever q is a multiple of m; and if we let 
q —> oo we obtain (17), completing the proof. 

For a possibly simpler proof, see J. W. S. Cassels, Pacific J. Math. 2 (1952), 
555-557. | 

Exercises 29 and 30 illustrate the nontriviality of this theorem, and they 
also demonstrate the fact that a q -distributed sequence will have probabilities 
deviating from the true (m, redistribution probabilities by essentially l/y/q at 
most. (Cf. (25).) The full hypothesis of oo-distribution is necessary for the proof 
of the theorem. 

As a result of Theorem C, we can prove that an oo-distributed sequence 
passes the serial test, the maximum-of-t test, the collision test, and the tests on 
subsequences mentioned in Section 3.3.2; it is not hard to show that the gap test, 
the poker test, and the run test are also satisfied (see exercises 12 through 14). 
The coupon collector’s test is considerably more difficult to deal with, but it too 
is satisfied (see exercises 15 and 16). 
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The existence of oo-distributed sequences of a rather simple type is guaran¬ 
teed by the next theorem. 

Theorem F (J. Franklin). The [0,1) sequence Uo, U\, ..., with 

U n = 0 n modl (26) 

is oo-distributed for almost all real numbers 9 > 1. That is, the set 

{ 9 | 9 > 1 and (26) is not oo-distributed } 


is of measure zero. 

The proofs of this theorem and some generalizations are given in Franklin’s paper 
cited below. | 

Franklin has shown that 9 must be a transcendental number for (26) to 
be oo-distributed. The powers (7r n mod 1) have been laboriously computed for 
n < 10000, using multiple-precision arithmetic, and the most significant 35 bits 
of each of these numbers, stored on a disk file, have successfully been used as 
a source of uniform deviates. According to Theorem F, the probability that 
the powers (7r n mod 1) are oo-distributed is equal to 1; yet because there are 
uncountably many real numbers, this gives us no information as to whether the 
sequence is really oo-distributed or not. It is a fairly safe bet that nobody in 
our lifetimes will ever prove that this particular sequence is not oo-distributed; 
but it might not be. Because of these considerations, one may legitimately 
wonder if there is any explicit sequence that is oo-distributed; i.e., is there an 
algorithm to compute real numbers U n for all n > 0, such that the sequence 
(U n ) is oo-distributed ? The answer is yes, as shown for example by D. E. Knuth 
in BIT 5 (1965), 246-250. The sequence constructed there consists entirely of 
rational numbers; in fact, each number U n has a terminating representation in 
the binary number system. Another construction of an explicit oo-distributed 
sequence, somewhat more complicated than the sequence just cited, follows from 
Theorem W below. See also N. M. Korobov, Izv. Akad. Nauk SSSR 20 (1956), 
649-660. 

C. Does oo-distributed = random? In view of all the above theory about 
oo-distributed sequences, we can be sure of one thing: the concept of an oo- 
distributed sequence is an important one in mathematics. There is also a good 
deal of evidence that the following statement is a valid formulation of the 
intuitive idea of randomness: 

Definition Rl. A [0,1) sequence is defined to be “random ” if it is an oo- 
distributed sequence. 

We have seen that sequences meeting this definition will satisfy all the statistical 
tests of Section 3.3.2 and many more. 
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Let us attempt to criticize this definition objectively. First of all, is every 
“truly random” sequence oo-distributed? There are uncountably many sequences 
Uo, Ui, ... of real numbers between zero and one. If a truly random number 
generator is sampled to give values Uq, Ui, ..., any of the possible sequences 
may be considered equally likely, and some of the sequences (indeed, uncount¬ 
ably many of them) are not even equidistributed. On the other hand, using any 
reasonable definition of probability on this space of all possible sequences leads 
us to conclude that a random sequence is oo-distributed with probability one. 
We are therefore led to formalize Franklin’s definition of randomness (as given 
at the beginning of this section) in the following way: 

Definition R2. A [0,1) sequence (U n ) is defined to be “ random” if, whenever P 
is a property such that P((V n )) holds with probability one for a sequence (V n ) 
of independent samples of random variables from the uniform distribution, then 
P((U n )) is true. 

Is it perhaps possible that Definition R1 is equivalent to Definition R2? Let 
us try out some possible objections to Definition Rl, and see whether these 
criticisms are valid. 

In the first place, Definition Rl deals only with limiting properties of the 
sequence as n —► oo. There are oo-distributed sequences in which the first million 
elements are all zero; should such a sequence be considered random? 

This objection is not very valid. If e is any positive number, there is no 
reason why the first million elements of a sequence should not all be less than e. 
With probability one, a truly random sequence contains infinitely many runs 
of a million consecutive elements less than e, so why can’t this happen at the 
beginning of the sequence? 

On the other hand, consider Definition R2 and let P be the property that 
all elements of the sequence are distinct; P is true with probability one, so any 
sequence with a million zeros is not random by this criterion. 

Now let P be the property that no element of the sequence is equal to zero; 
again, P is true with probability one, so by Definition R2 any sequence with a 
zero element is nonrandom. More generally, however, let Xq be any fixed number 
between zero and one, and let P be the property that no element of the sequence 
is equal to Xo; Definition R2 now says that no random sequence may contain 
the element x 0 ! We can now prove that no sequence satisfies the condition of 
Definition R2. (For if Uo, U\, ... is such a sequence, take x 0 = Uq.) 

Therefore if Rl is too weak a definition, R2 is certainly too strong. The 
“right” definition must be less strict than R2. We have not really shown that Rl 
is too weak, however, so let us continue to attack it some more. As mentioned 
above, an oo-distributed sequence of rational numbers has been constructed. 
(Indeed, this is not so surprising; see exercise 18.) Almost all real numbers are 
irrational; perhaps we should insist that 

Pr (U n is rational) — 0 


for a random sequence. 
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Note that the definition of equidistribution says that Pr(u <U n <V) = 
v — u. There is an obvious way to generalize this definition, using measure 
theory: “If S C [0,1) is a set of measure p, then 

Pt(U u €S) = fJt, (27) 

for all random sequences ([/„).” In particular, if S is the set of rationals, 
it has measure zero, so no sequence of rational numbers is equidistributed in 
this generalized sense. It is reasonable to expect that Theorem B could be 
extended to Lebesgue integration instead of Riemann integration, if property 
(27) is stipulated. However, once again we find that definition (27) is too strict, 
for no sequence satisfies that property. If Uq, U\> ... is any sequence, the set 
S — {Uq, Ui, ... } is of measure zero, yet Pr(C7 n £ S) = 1. Thus, by the force 
of the same argument we used to exclude rationals from random sequences, we 
can exclude all random sequences. 

So far Definition R1 has proved to be defensible. There are, however, some 
quite valid objections to it. For example, if we have a random sequence in the 
intuitive sense, the infinite subsequence 

Uq, Ui, U 4 , Uq, ..., U n 2 , ... (28) 

should also be a random sequence. This is not always true for an oo-distributed 
sequence. In fact, if we take any oo-distributed sequence and set U n 2 <— 0 for 
all n, the counts 1 ^( 77 .) that appear in the test of fr-distributivity are changed by 
at most \fn, so the limits of the ratios v k (n)/n remain unchanged. Definition R1 
unfortunately fails to satisfy this randomness criterion. 

Perhaps we should strengthen R1 as follows: 

Definition R3. A [0,1) sequence is said to be “random” if each of its infinite 
subsequences is oo-distributed. 

Once again, however, the definition turns out to be too strict; any equidistributed 
sequence (U n ) has a monotonic subsequence with U So < U Sl < U S2 < • • •. 

The secret is to restrict the subsequences so that they could be defined by 
somebody who does not look at U n before deciding whether or not it is to be in 
the subsequence. The following definition now suggests itself: 

Definition R4. A [0,1) sequence (U n ) is said to be “random” if, for every effective 
algorithm that specifies an infinite sequence of distinct nonnegative integers s n 
for n > 0, the subsequence U So , U Sl , U S2 , ... corresponding to this algorithm is 
oo-distributed. 

The algorithms referred to in Definition R4 are effective procedures that 
compute s n , given n. (See Section 1.1.) Thus, for example, the sequence 
(n n mod 1) will not satisfy R4, since it is either not equidistributed or there is an 
effective algorithm that determines an infinite subsequence s n with ( 7 r 5 ° mod 1) < 
(7r Sl mod 1) < ( 7 r S2 mod 1) < • • •. Similarly, no explicitly defined sequence can 
satisfy Definition R4 ; this is appropriate, if we agree that no explicitly defined 
sequence can really be random. It is quite likely, however, that the sequence 
(9 n mod 1) will satisfy Definition R4, for almost all real numbers 9 > 1; this is no 
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contradiction, since almost all 0 are uncomputable by algorithms. The following 
facts are known, for example: (i) The sequence (6 n mod 1) satisfies Definition R4 
for almost all real 0 > 1, if “oo-distributed” is replaced by “1-distributed.” This 
theorem was proved by J. F. Koksma, Compositio Mathematica 2 (1935), 250- 
258. (ii) The particular sequence (0 s ( n )modl) is oo-distributed for almost all 
real By 1, if (s(n)) is a sequence of integers for which s(n -f-1) — s(n) —> oo as 
n —► oo. For example, we could have s(n) = n 2 , or s(n) = [nlgnj. 

Definition R4 is much stronger than Definition Rl; but it is still reasonable 
to claim that Definition R4 is too weak. For example, let (U n ) be a truly random 
sequence, and define the subsequence ( U Sn ) by the following rules: so = 0, and 
(for n > 0) s n is the smallest integer > n for which U Sn — i, U Sn — 2 , ■ • •, U Sri — n 
are all less than \. Thus we are considering the subsequence of values following 
the first consecutive run of n values less than Suppose that “U n < J” 
corresponds to the value “heads” in the flipping of a coin. Gamblers tend to feel 
that a long run of “heads” makes the opposite condition, “tails,” more probable, 
assuming that a true coin is being used; and the subsequence {U Sn ) just defined 
corresponds to a gambling system for a man who places his nth bet on the coin 
toss following the first run of n consecutive “heads.” The gambler may think 
that Pr (U Sn > J) is more than but of course in a truly random sequence 
(U Sn ) will be completely random. No gambling system will ever be able to beat 
the odds! Definition R4 says nothing about subsequences formed according to 
such a gambling system, so apparently we need something more. 

Let us define a “subsequence rule” Z as an infinite sequence of functions 
{fn{ x u • • •; x n)) where, for n > 0, f n is a function of n variables, and the 
value of f n (x 1 ,..., x n ) is either 0 or 1. Here ..., x n are elements of some 
set S. (Thus, in particular, /o is a constant function, either 0 or 1.) A sub¬ 
sequence rule Z defines a subsequence of any infinite sequence (X n ) of elements 
of S as follows: The nth term X n is in the subsequence (X n )Z if and only if 

fn{Xo, X\ , • • •, X n _ 1 ) = 1. Note that the subsequence { X n )Z thus defined is 

not necessarily infinite, and it may in fact contain no elements at all. 

For example, the gambler’s subsequence just described corresponds to the 
following subsequence rule: “/o = 1; and for n > 0, f n (x 1 ,..., x n ) = 1 if and 
only if there is some k in the range 0 < k < n such that the k consecutive 
parameters x m , x m —i, • • •, Xm—k +1 are all < \ when m = n but not when 
k < m < n.” 

A subsequence rule Z is said to be computable if there is an effective algo¬ 
rithm that determines the value of f n (x 1 ,..., x n ), when n and x\, ..., x n are 
given as input. We had better restrict ourselves to computable subsequence rules 
when trying to define randomness, lest we obtain an overly restrictive definition 
like R3 above. But effective algorithms cannot deal nicely with arbitrary real 
numbers as inputs; for example, if a real number x is specified by an infinite 
radix-10 expansion, there is no algorithm to determine if x is < J or not, since 
all digits of the number 0.333... have to be examined. Therefore computable 
subsequence rules do not apply to all [0,1) sequences, and it is convenient to 
base our next definition on 6-ary sequences. 
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Definition R5. A b-ary sequence is said to be “random” if every infinite sub¬ 
sequence defined by a computable sub sequence rule is 1-distributed. 

A [0,1) sequence (U n ) is said to be “random” if the b-ary sequence (|M7 n J) 
is “random” for all integers b > 2. 

Note that Definition R5 says only “1-distributed,” not “oo-distributed.” It 
is interesting to verify that this may be done without loss of generality. For 
we may define an obviously computable subsequence rule Z(ai ... ajt) as follows, 
given any 6-ary number a \... a^: Let f n (x i,..., x n ) = 1 if and only if n > A; — 1 
and x n —k-\-i — oi, ..., z n _i = a^—i , x n = a^. Now if (X n ) is a A:-distributed 
6-ary sequence, this rule £(ai... dk )—which selects the subsequence consisting 
of those terms just following an occurrence of di... a,k —defines an infinite sub¬ 
sequence; and if this subsequence is 1-distributed, each of the (k -f- l)-tuples 
ai .. 1 for 0 < afc_|_i < 6 occurs with probability l/6 fc+1 in (X n ). Thus 

we can prove that a sequence satisfying Definition R5 is /^-distributed for all k, 
by induction on k. Similarly, by considering the “composition” of subsequence 
rules—if Z\ defines an infinite subsequence (X n )Z\, then we can define Z\Z 2 to 
be the subsequence rule for which (X n )ZiZ 2 = ((X n )Z\)Z 2 —we find that all 
subsequences considered in Definition R5 are oo-distributed. (See exercise 32.) 

The fact that oo-distribution comes out of Definition R5 as a very special 
case is encouraging, and it is a good indication that we may at last have found the 
definition of randomness we have been seeking. But alas, there still is a problem. 
It is not clear that sequences satisfying Definition R5 must satisfy Definition R4. 
The “computable subsequence rules” we have just specified always enumerate 
subsequences (X Sn ) for which so < Si < • • •, but (s n ) does not have to be 
monotone in R4; it must only satisfy the condition s n y^ s m for ny^m. 

To meet this objection, we may combine Definitions R4 and R5 as follows: 

Definition R6. A b-ary sequence (X n ) is said to be “random” if, for every effective 
algorithm that specifies an infinite sequence of distinct nonnegative integers (s n ) 
as a function of n and the values of X So , ..., X 8n _ lf the subsequence (X Sn ) 
corresponding to this algorithm is “random” in the sense of Definition R5. 

A [0,1) sequence (U n ) is said to be “random” if the b-ary sequence ([bU n \) 
is “random” for all integers 6 > 2. 

The author contends* that this definition surely meets all reasonable philo¬ 
sophical requirements for randomness, so it provides an answer to the principal 
question posed in this section. 

D. Existence of random sequences. We have seen that Definition R3 is too strong, 
in the sense that no sequence can satisfy that definition; and the formulation of 
Definitions R4, R5, and R6 above was carried out in an attempt to recapture the 
essential characteristics of Definition R3. In order to show that Definition R6 is 
not overly restrictive, it is still necessary for us to prove that sequences satisfying 
all these conditions exist. Intuitively, we feel quite sure that there is no problem, 


*At least, he made such a contention when originally preparing the material for this section in 
1966. 
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because we believe that a truly random sequence exists and satisfies R 6 ; but a 
proof is really necessary to show that the definition is consistent. 

An interesting method for constructing sequences satisfying Definition R5 
has been found by A. Wald, starting with a very simple 1 -distributed sequence. 

Lemma T. Let the sequence of real numbers (V n ) be defined in terms of the 
binary system as follows: 

Vo = 0, V\ = .1, V 2 = .01, V3 = .ll, = .001, 

v n = ,c r ... Ci 1 ifn = 2 r + c 1 2 r “ 1 H- b^r- (29) 

Let ^...br denote the set of all real numbers in [ 0,1) whose binary representation 
begins with 0 .&i... b r ; thus 

hi.-.br = [(0.6i.. ■ M 2 , (O. 61 ... b r ) 2 -f 2 r ). (30) 

Then if v(n) denotes the number ofVk in I^...br for 0 < k < n, we have 

| v(ri)/n — 2~ r | < 1/n. (31) 

Proof. Since v(ri) is the number of k for which k mod 2 r = (b r ... 61 ) 2 , we have 
i/(ri) — t or t + 1 when \n/2 r \ = t. Hence | i/(ri) — n/2 r \ <1. | 

It follows from (31) that the sequence (L2 r KJ) is an equidistributed 2 r -ary 
sequence; hence by Theorem A, (V n ) is an equidistributed [ 0,1) sequence. Indeed, 
it is pretty clear that (V n ) is about as equidistributed as a [0,1) sequence can be. 
(For further discussion of this and related sequences, see J. G. van der Corput, 
Proc. Koninklijke Nederl. Akad. Wetenschappen 38 (1935), 813-821, 1058-1066; 
J. H. Halton, Numerische Math. 2 (1960), 84-90, 196; L. H. Ramshaw, J. Number 
Theory, to appear.) 

Now let Zi, Z 2 , ... be infinitely many subsequence rules; we seek a sequence 
(U n ) for which all the infinite subsequences ( U n )Zj are equidistributed. 

Algorithm W (Wald sequence). Given an infinite sequence of subsequence rules 
Zi, Z 2 , • • ■ that define subsequences of [ 0 , 1 ) sequences of rational numbers, this 
procedure defines a [0,1) sequence (U n ). The computation involves infinitely 
many auxiliary variables C[ai,..., a r ], where r > 1 and where a 3 = 0 or 1 for 
1 < j < r. These variables are initially all zero. 

Wl. [Initialize n] Set n <— 0. 

W2. [Initialize r.] Set r <— 1. 

W 3 . [Test Z r ] If the element U n is to be in the subsequence defined by Z r , based 
on the values of Uk for 0 < k < n, set a r *<— 1 ; otherwise set a r 0 . 

W4. [B[ai,... ,a r \ full?] If C[a lt ...,ar] < 3 • 4 r “^ 1 , go to W 6 . 

W5. [Increase r .] Set r <— r -f-1 and return to W3. 

W 6 . [Set U n ] Increase the value of C[a \,..., a r \ by 1 and let k be its new value. 
Set U n <— 14, where 14 is defined in Lemma T above. 

W 7 . [Advance n] Increase n by 1 and return to W2. | 
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Strictly speaking, this is not an algorithm, since it doesn’t terminate; it 
would of course be easy to modify the procedure to stop when n reaches a given 
value. The reader will find it easier to grasp the idea of the construction by 
trying it out manually, replacing the number 3 • 4 r—1 of step W4 by 2 r during 
this experiment. 

Algorithm W is not meant to be a practical source of random numbers. It 
is intended to serve only a theoretical purpose: 

Theorem W. Let (U n ) be the sequence of rational numbers defined by Algorithm 
W, and let k be a positive integer. If the subsequence ( U n )Zk is infinite, it is 
1 -distributed. 

Proof. Let A[a ±,..., a r ] denote the (possibly empty) subsequence of (U n ) con¬ 
taining precisely those elements U n that, for all j < r, belong to subsequence 
{U n )Zj if a,j = 1 and do not belong to subsequence (U n )Zj if a,j = 0. 

It suffices to prove, for all r > 1 and all pairs of binary numbers ... a r 
and b\...b r , that Pr([/ n £ = 2“" r with respect to the subsequence 

A[ai ,..., a r ], whenever the latter is infinite. (See Eq. (30).) For if r > k, 
the infinite sequence ( U n )Zk is the finite union of the disjoint subsequences 
A[ai,... ,a r ] for a* = 1 and a,j = 0 or 1 for 1 < j < r, j k\ and it follows 
that Pr (U n £ h x ...b r ) = ‘Z~' r with respect to { U n )Zk • (See exercise 33.) This is 
enough to show that the sequence is 1-distributed, by Theorem A. 

Let B[a i,..., a r ] denote the subsequence of (U n ) that consists of the values 
for those n in which C[a \,..., a r \ is increased by one in step W6 of the algo¬ 
rithm. By the algorithm, B[ai, ...,a r ] is a finite sequence with at most 3 • 4 r—1 
elements. All but a finite number of the members of A[ai ,..., a r ] come from the 
subsequences B[a \,..., a r ,..., at], where % = 0 or 1 for r <]<t. 

Now assume that A[ai ,..., a r ] is infinite, and let A[a\, ...,a r ] = (U 8n ), 
where So < 5 i < 52 < • • •. If iV is a large integer, with 4 r < A q < N < 4 9+1 , 
it follows that the number of values of k < N for which U Sk is in is 

(except for finitely many elements at the beginning of the subsequence) 

v(N) = v(Ni) 4-f v(N m ). 

Here m is the number of subsequences B[ai ,..., a t ] listed above in which u st 
appears for some k < N; Nj is the number of values of k with U Sk in the 
corresponding subsequence; and v(Nj) is the number of such values that are also 
in Ib^.-br- Therefore by Lemma T, 

HAT) - 2 r iV| = MAT,) - 2- r Nx + • • • + v(N m ) - 2 ~ r N m \ 

< \v(N x )- 2- r N x \ 4-h IKA^m) —2~ r AT m | 

<m< 1 -f- 2 + 4 -|-f- 2 q ~ r+1 < 2 <?+1 . 

The inequality on m follows here from the fact that, by our choice of N, U SN is 
in B[ai,.. .,a t \ for some t < q -f-1. 

We have proved that 

\v{N)/N — 2~ r \ < 2 q+1 /N < 2/y/N. | 
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To show finally that sequences satisfying Definition R5 exist, we note first 
that if (U n ) is a [ 0,1) sequence of rational numbers and if Z is a computable sub¬ 
sequence rule for a 5-ary sequence, we can make Z into a computable subsequence 
rule Z' for (U n ) by letting f' n (x i,..., x n ) in Z' equal f n ( [bxi],..., [bx n J) in Z. If 
the [0,1) sequence ( U n )Z' is equidistributed, so is the 5-ary sequence ([bU n \)Z. 
Now the set of all computable subsequence rules for 5-ary sequences, for all values 
of 5, is countable (since only countably many effective algorithms are possible), so 
they may be listed in some sequence Zi, Z 2 , ...; therefore Algorithm W defines 
a [0,1) sequence that is random in the sense of Definition R5. 

This brings us to a somewhat paradoxical situation. As we mentioned earlier, 
no effective algorithm can define a sequence that satisfies Definition R4, and for 
the same reason there is no effective algorithm that defines a sequence satisfying 
Definition R5. A proof of the existence of such random sequences is necessarily 
nonconstructive; how then can Algorithm W construct such a sequence? 

There is no contradiction here; we have merely stumbled on the fact that 
the set of all effective algorithms cannot be enumerated by an effective algo¬ 
rithm. In other words, there is no effective algorithm to select the j th comput¬ 
able subsequence rule Z 3 \ this happens because there is no effective algorithm to 
determine if a computational method ever terminates. (We shall return to this 
topic in Chapter 11.) Important large classes of algorithms can be systemati¬ 
cally enumerated; thus, for example, Algorithm W shows that it is possible to 
construct, with an effective algorithm, a sequence that satisfies Definition R5 if 
we restrict consideration to subsequence rules that are “primitive recursive.” 

By modifying step W6 of Algorithm W, so that it sets U n <— Vk+t instead 
of Vk, where t is any nonnegative integer depending on ai, ..., a r , we can show 
that there are uncountably many [0,1) sequences satisfying Definition R5. 

The following theorem shows still another way to prove the existence of 
uncountably many random sequences, using a less direct argument based on 
measure theory, even if the strong definition R6 is used: 

Theorem M. Let the real number x, 0 < x < 1, correspond to the binary 
sequence (X n ) if the binary representation of x is (O.X 0 X 1 ... ) 2 . Under this 
correspondence, almost all x correspond to binary sequences that are random in 
the sense of Definition R6. (In other words, the set of all real x that correspond 
to a binary sequence that is nonrandom by Definition R6 has measure zero.) 

Proof. Let S be an effective algorithm that determines an infinite sequence of 
distinct nonnegative integers (s n ), where the choice of s n depends only on n and 
X Sk for 0 < k < n; and let Z be a computable subsequence rule. Then any 
binary sequence (X n ) leads to a subsequence (X Sri )Z, and Definition R6 says 
this subsequence must either be finite or 1-distributed. It suffices to prove that 
for fixed Z and S the set N(Z, S) of all real x corresponding to (X n ), such 
that (X Sn )Z is infinite and not 1 -distributed, has measure zero. For x has a 
nonrandom binary representation if and only if x is in (J N(Z, S), taken over the 
countably many choices of Z and 5. 
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Therefore let Z and S be fixed. Consider the set T(a x a 2 ...a r ) defined for 
all binary numbers aia 2 ... a r as the set of all x corresponding to { X n ), such that 
(X Sn )Z has > r elements whose first r elements are respectively equal to a x , a 2 , 
..., a r . Our first result is that 

T(aia 2 ... a r ) has measure < 2~ r . (32) 

To prove this, we start by observing that T(ai<2 2 .. • a r ) is a measurable set: Each 
element of T(aia 2 ... a r ) is a real number x = (O.X 0 X 1 ... ) 2 for which there 
exists an integer m such that algorithm S determines distinct values so, $i, ..., 
s m , and rule Z determines a subsequence of X SQ , X Sl , ..., X Sm such that X Srn 
is the rth element of this subsequence. The set of all real y = (O.YoYi... ) 2 
such that Y Sk = X Sk for 0 < k < m also belongs to T(a x a 2 ... a r ), and 
this is a measurable set consisting of the finite union of dyadic subintervals 
hx.-.bf Since there are only countably many such dyadic intervals, we see 
that T(a x a 2 ...a r ) is a countable union of dyadic intervals, and it is therefore 
measurable. Furthermore, this argument can be extended to show that the 
measure of T(< 2 i ... a r — 1 0) equals the measure of T(a x ... a r —\ 1), since the latter 
is a union of dyadic intervals obtained from the former by requiring that Y Sk = 
X Sk for 0 < k < m and Y Sm ^ X Sm . Now since 

T(a 1 ... a r — 1 0) U T(aj... a r — 1 1) C T(oi... a r — 1 ), 

the measure of T(aia 2 ...a r ) is at most one-half the measure of T(a x ... a r —i). 
The inequality (32) follows by induction on r. 

Now that (32) has been established, the remainder of the proof is essentially 
to show that the binary representations of almost all real numbers are equi- 
distributed. The next few paragraphs constitute a rather long but not difficult 
proof of this fact, and they serve to provide probability estimates that are useful 
in many other problems. 

For 0 < e < 1, let B(r , e) be (J T(a x ... a r ), where the union is taken over 
all binary numbers a x ... a r for which the number i/(r) of zeros among a x ... a r 
satisfies 

| u(r) — \r\ > 1 ~f- er. 

The number of such binary numbers is C(r, e) = £ (Q summed over the values 
of k with \k — Jr| > 1 -f er. A suitable estimate for the tail of the binomial 
distribution can be obtained by the following standard trick: Let x and p be any 
positive numbers less than 1, let q = 1 — p, and let s = (p -j- e)r. Then 



By elementary calculus, the minimum value of x s (q -|- p/x) r occurs when x = 
(; p/(p + e))/(q/(q — e)), and this value of x yields 
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Now when e < min(p, q) we have 


In (v/{v + e)) 


p-4-e 


ln(q/((j - t)f £ = +e - 


e . e 

6 ~ -—I" — 


2 p 6 p 2 12 p 3 


+ •••<-£, 


< e — 


2 q 6 q 2 12q 3 ' ' 2 q' 

so we have the following bound for all r > 0 and 0 < e < min(p, q): 


E 


k>(p-\-e)r 



k _,r —k 


vq 


< e 


-e 2 r/(2q) 


(33) 


But C(r, e) is 2 r+1 times this left-hand quantity, in the special case p — q = 
hence by (32) 

B(r, e) has measure < 2 ~ r C(r,e) < 2e —e r . 

The next step is to define 

B*(r, e) = B(r, e) U B(r + 1, c) U B(r + 2, e) U • • •. 

The measure of B*(r,e) is at most X^ >r 2e~ e2fc , and this is the remainder of a 
convergent series, so 


lim (measure of B*(r, e)) = 0. (34) 

r —*oo x 

Now if x is a real number whose binary expansion (O.X 0 Xi... ) 2 leads to an 
infinite sequence {X Sn )H that is not 1-distributed, and if i/[r) denotes the number 
of zeros in the first r elements of the latter sequence, then 

| v(r)/r— i| > 2e, 

for some e > 0 and infinitely many r. This means x is in B*(r, e) for all r. So 
finally we find that 

N(R,S) = u n B*{r,l/t); 

t> 2 r> 1 

and, by (34), f\ r>1 B*(r,l/t) has measure zero for all t. Hence N(Z,S) has 
measure zero. | 

From the existence of binary sequences satisfying Definition R6, we can show 
the existence of [0,1) sequences that are random in this sense. For details, see 
exercise 36. The consistency of Definition R6 is thereby established. 

E. Random finite sequences. An argument was given above to indicate that it 
is impossible to define the concept of randomness for finite sequences; any given 
finite sequence is as likely as any other. Still, nearly everyone would agree that 
the sequence 011101001 is “more random” than 101010101, and even the latter 
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sequence is “more random” than 000000000. Although it is true that truly 
random sequences will exhibit locally nonrandom behavior, we would expect 
such behavior only in a long finite sequence, not in a short one. 

Several ways for defining the randomness of a finite sequence have been 
proposed, and only a few of the ideas will be sketched here. Let us consider only 
the case of 6-ary sequences. 

Given a 6-ary sequence Xi, X 2 , ..., Xm, we can say that 
Pr(S(n)) as p, if \v(N)/N — p| < 1 /VN, 

where v(n) is the quantity appearing in Definition A at the beginning of this 
section. The above sequence can be called “/c-distributed” if 

Pr(X n X n+ i = x x x 2 ...x k ) & 1 /b k 

for all 6-ary numbers X\X 2 ... x k . (Cf. Definition D; unfortunately a sequence 
may be ^-distributed by this new definition when it is not (k — l)-distributed.) 

A definition of randomness may now be given analogous to Definition Rl, 
as follows: 

Definition Ql. A b-ary sequence of length N is “random” if it is k-distributed 
(in the above sense) for all positive integers k < log, N. 

According to this definition, for example, there are 170 nonrandom binary 
sequences of length 11: 

00000001111 10000000111 11000000011 11100000001 

00000001110 10000000110 11000000010 11100000000 

00000001101 10000000101 11000000001 10100000001 

00000001011 10000000011 01000000011 01100000001 

00000000111 

plus 01010101010 and all sequences with nine or more zeros, plus all sequences 
obtained from the preceding sequences by interchanging ones and zeros. 

Similarly, we can formulate a definition for finite sequences analogous to 
Definition R6. Let A be a set of algorithms, each of which is a combination 
selection and choice procedure that gives a subsequence { X Sn )R , as in the proof 
of Theorem M. 

Definition Q2. The b-ary sequence Xi, X 2 ,..., Xn is (n, e)-random with respect 
to a set of algorithms A, if for every subsequence X tlJ X t2 , ..., X tm determined 
by an algorithm of A we have either m < n or 

— Va(X tlt ...,X tm ) — \ <e for 0 < a < 6. 
m b 

Here v a {xi ,..., x m ) is the number of a’s in the sequence x lf , x m . 
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(In other words, every sufficiently long subsequence determined by an algo¬ 
rithm of A must be approximately equidistributed.) The basic idea in this case 
is to let A be a set of “simple” algorithms; the number (and the complexity) of 
the algorithms in A can grow as N grows. 

As an example of Definition Q2, let us consider binary sequences, and let A 
be just the following four algorithms: 

a) Take the whole sequence. 

b) Take alternate terms of the sequence, starting with the first. 

c) Take the terms of the sequence following a zero. 

d) Take the terms of the sequence following a one. 

Now a sequence Xi, ..., Xg is (4, |)-random if: 

by (a), |J(X 1 + ---+X 8 ) — J| < J, he., if there are 3, 4, or 5 ones; 

by (b), ||(Xi-|-X 3 -(-X 5 -j-X 7 ) 11 < J, i.e., if there are two ones in odd- 

numbered positions; 

by (c), there are three possibilities depending on how many zeros occupy posi¬ 
tions Xi, ..., X 7 : if there are 2 or 3 zeros here, there is no condition 

to test (since n — 4); if there are 4 zeros, they must respectively be 

followed by two zeros and two ones; and if there are 5 zeros, they must 
respectively be followed by two or three zeros; 

by (d), we get conditions similar to those implied by (c). 

It turns out that only the following binary sequences of length 8 are (4, J)- 


random with respect to 

these rules: 



00001011 

00101001 

01001110 

01101000 

00011010 

00101100 

01011011 

01101100 

00011011 

00110010 

01011110 

01101101 

00100011 

00110011 

01100010 

01110010 

00100110 

00110110 

01100011 

01110110 

00100111 

00111001 

01100110 



plus those obtained by interchanging 0 and 1 consistently. 

It is clear that we could make the set of algorithms so large that no sequences 
satisfy the definition, when n and e are reasonably small. A. N. Kolomogorov has 
proved that an (n, e)-random binary sequence will always exist, for any given N, 
if the number of algorithms in A does not exceed 

le 2 " £2(, - £ ). (35) 

2 

This result is not nearly strong enough to show that sequences satisfying Defini¬ 
tion Q1 will exist, but the latter can be constructed efficiently using the procedure 
of Rees in exercise 3.2.2-21. 

Still another interesting approach to a definition of randomness has been 
taken by Per Martin-Lof [Information and Control 9 (1966), 602-619]. Given a 
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finite 6-ary sequence X\, ..., Xn, let l(X i,..., Xm) be the length of the shortest 
Turing machine program that generates this sequence. (For a definition of Turing 
machines, see Chapter 11; alternatively, we could use certain classes of effective 
algorithms, such as those defined in exercise 1.1-8.) Then l(Xi ,... ,X^) is a 
measure of the “patternlessness” of the sequence, and we may equate this idea 
with randomness. The sequences of length N that maximize l(Xi ,... ,Xjv) may 
be called random. (From the standpoint of practical random number generation 
by computer, this is, of course, the worst definition of “randomness” that can 
be imagined!) 

Essentially the same definition of randomness was independently given by G. 
Chaitin at about the same time; see JACM 16 (1969), 145-159. It is interesting 
to note that even though this definition makes no reference to equidistribution 
properties as our other definitions have, Martin-Lof and Chaitin have proved that 
random sequences of this type also have the expected equidistribution properties. 
In fact, Martin-Lof has demonstrated that such sequences satisfy a 11 computable 
statistical tests for randomness, in an appropriate sense. 

For further developments in the definition of random finite sequences, see 
A. K. Zvonkin and L. A. Levin, Uspekhi Mat. Nauk 25,6 (Nov. 1970), 85-127 
[English translation in Russian Math. Surveys 25,6 (Nov. 1970), 83-124]; L. A. 
Levin, Doklady Akad. Nauk SSSR 212 (1973), 548-550. 

F. Summary, history, and bibliography. We have defined several degrees of 
randomness that a sequence might possess. 

An infinite sequence that is oo-distributed satisfies a great many useful 
properties that are expected of random sequences, and there is a rich theory 
concerning oo-distributed sequences. (The exercises that follow develop several 
important properties of oo-distributed sequences that have not been mentioned 
in the text.) Definition R1 is therefore an appropriate basis for theoretical studies 
of randomness. 

The concept of an oo-distributed 6-ary sequence was introduced in 1909 by 
Emile Borel. He essentially defined the concept of an (m, /c)-distributed sequence, 
and showed that the 6-ary representations of almost all real numbers are (m, k)- 
distributed for all m and k. He called such numbers normal to base 6. An 
excellent discussion of this topic appears in his well-known book, Legons sur la 
theorie des fonctions (2nd ed., 1914), 182-216. 

The notion of an oo-distributed sequence of real numbers, also called a 
completely equidistributed sequence, first appeared in a note by N. M. Korobov 
in Doklady Akad. Nauk SSSR 62 (1948), 21-22. Korobov and several of his 
colleagues developed the theory of such sequences quite extensively in a series of 
papers during the 1950s. Completely equidistributed sequences were independ¬ 
ently studied by Joel N. Franklin, Math. Comp. 17 (1963), 28-59, in a paper 
that is particularly noteworthy because it was inspired by the problem of ran¬ 
dom number generation. The book Uniform Distribution of Sequences by L. 
Kuipers and H. Niederreiter (New York: Wiley, 1974) is an extraordinarily com¬ 
plete source of information about the rich mathematical literature concerning 
fc-distributed sequences of all kinds. 
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We have seen, however, that oo-distributed sequences need not be suffi¬ 
ciently haphazard to qualify completely as “random.” Three definitions, R4, 
R5, and R6, were formulated above to provide the additional conditions; and 
Definition R6, in particular, seems to be an appropriate way to define the concept 
of an infinite random sequence. It is a precise, quantitative statement that may 
well coincide with the intuitive idea of true randomness. 

Historically, the development of these definitions was primarily influenced 
by R. von Mises’ quest for a good definition of “probability.” In Math. Zeitschrift 
5 (1919), 52-99, von Mises proposed a definition similar in spirit to Definition R5, 
although stated too strongly (like our Definition R3) so that no sequences satis¬ 
fying the conditions could possibly exist. Many people noticed this discrepancy, 
and A. H. Copeland [Amer. J. Math. 50 (1928), 535-552] suggested weakening 
von Mises’ definition by substituting what he called “admissible numbers” (or 
Bernoulli sequences). These are equivalent to oo-distributed [0,1) sequences in 
which all entries U n have been replaced by 1 if U n < p or by 0 if U n > p, 
for a given probability p. Thus Copeland was essentially suggesting a return to 
Definition Rl. Then Abraham Wald showed that it is not necessary to weaken 
von Mises’ definition so drastically, and he proposed substituting a countable set 
of subsequence rules. In an important paper [Ergebnisse eines math. Kolloquiums 
8 (Vienna, 1937), 38-72], Wald essentially proved Theorem W, although he 
made the erroneous assertion that the sequence constructed by Algorithm W 
also satisfies the stronger condition that Pr(C/ n 6 A) = measure of A, for all 
Lebesgue measurable A C [0,1). We have observed that no sequence can satisfy 
this property. 

The concept of “computability” was still very much in its infancy when Wald 
wrote his paper, and A. Church [BuIJ. Amer. Math. Soc. 47 (1940), 130-135] 
showed how the precise notion of “effective algorithm” could be added to Wald’s 
theory to make his definitions completely rigorous. The extension to Definition 
R6 was due essentially to A. N. Kolmogorov [Sankhya (A) 25 (1963), 369-376], 
who proposed Definition Q2 for finite sequences in that same paper. Another 
definition of randomness for finite sequences, somewhere “between” Definitions 
Q1 and Q2, had been formulated many years earlier by A. S. Besicovitch [Math. 
Zeitschrift 39 (1934), 146-156]. 

The publications of Church and Kolmogorov considered only binary se¬ 
quences for which Pr(X n = 1) = p for a given probability p. Our discussion 
in this section has been slightly more general, since a [0,1) sequence essentially 
represents all p at once. The von Mises-Wald-Church definition has been refined 
in yet another interesting way by J. V. Howard, Zeitschr. f. math. Logik und 
Grundlagen d. Math. 21 (1975), 215-224. 

Another important contribution was made by Donald W. Loveland [Zeitschr. 
f. math. Logik und Grundlagen d. Math. 12 (1966), 279-294], who discussed 
Definitions R4, R5, R6, and several intermediate concepts. Loveland proved 
that there are R5-random sequences that do not satisfy R4, thereby establishing 
the need for a stronger definition such as R6. In fact, he defined a rather simple 
permutation (/(n)) of the nonnegative integers, and an Algorithm W' analogous 
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to Algorithm W, such that Pr(C//( n ) > £) — Pr{Uf( n ) > i) > J for every 
R5-random sequence (U n ) produced by Algorithm W' when it is given an infinite 
set of subsequence rules R,k- 

Although Definition R6 is intuitively much stronger than R4, it is apparently 
not a simple matter to prove this rigorously, and for several years it was an open 
question whether or not R4 implies R6. Finally Thomas Herzog and James C. 
Owings, Jr., discovered how to construct a large family of sequences that satisfy 
R4 but not R6. [See Zeitschr. f. math. Logik und Grundlagen d. Math. 22 (1976), 
385-389.] 

Kolmogorov wrote another significant paper [Problemy Peredati Informatsii 
1 (1965), 3-11] in which he considered the problem of defining the “information 
content” of a sequence, and this work led to Chaitin and Martin-Lof s interesting 
definition of finite random sequences via “patternlessness.” [See IEEE Trans. 
IT-14 (1968), 662-664.] 

For a philosophical discussion of random sequences, see K. R. Popper, The 
Logic of Scientific Discovery (London, 1959), especially the interesting construc¬ 
tion on pp. 162-163, which he first published in 1934. 

Further connections between random sequences and recursive function the¬ 
ory have been explored by D. W. Loveland, Trans. Amer. Math. Soc. 125 (1966), 
497-510. See also C.-P. Schnorr [Zeitschr. Wahr. verw. Geb. 14 (1969), 27- 
35], who found strong relations between random sequences and the “species 
of measure zero” defined by L. E. J. Brouwer in 1919. Schnorr’s subsequent 
book Zufalligkeit und Wahrscheinlichkeit [Lecture Notes in Math. 218 (Berlin: 
Springer, 1971)] gives a detailed treatment of the entire subject of randomness 
and makes an excellent introduction to the ever-growing advanced literature on 
the topic. 


EXERCISES 

1 . [10] Can a periodic sequence be equidistributed? 

2 . [10] Consider the periodic binary sequence 0, 0, 1 , 1 , 0, 0, 1, 1, ... . Is it 
1-distributed? Is it 2-distributed? Is it 3-distributed? 

3 . [M22] Construct a periodic ternary sequence that is 3-distributed. 

► 4. [HM22] Let U n = (2 LlgCn+ 1 b/3)mod 1 . What is Pr {U n < J)? 

5 . [ HM14 ] Prove that Pr(S(n) and T(n))+Pr(S(n) or T{n)) = Pr(5(n))+Pr(T(n)), 
for any two statements S(n), T(n), provided that at least three of the limits exist. For 
example, if a sequence is 2 -distributed, we would find that 


Pr(t/1 <U n < v 1 or U 2 < Un+l < V2) = Vi — lii + V2 — U2 — (Vl — U\)(V2 — U2). 


6 . [HM23] Let 5i(n), ^(n), ... be an infinite sequence of statements about mutually 
disjoint events; i.e., Si(n ) and Sj(n ) cannot simultaneously be true if i 7 ^ j. Assume 
that Pr(Sj(n)) exists for each j > 1 . Show that Pr(Sj(n) is true for some j > l) > 
Pr(Sj(n)), and give an example to show that equality need not hold. 
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7. [ HM21 ] Let {Sij(n)} be a family of statements such that Pr(S'i J (n)) exists for all 
i,j > 1. Assume that for all n > 0, Sij(n) is true for exactly one pair of integers i,j. 
If ]TV ■ >1 Pr(Sr,(n)) = 1, does it follow that “Pr(SV,(n) is true for some j > l)” exists 
for alH > 1, and that it equals X^j >i Pr(Sij(n))? 

8. [ M15 ] Prove (13). 

9. [ HM20 ] Prove Lemma E. [Hint: Consider — a) 2 .] 

► 10. [ HM22} Where was the fact that m divides q used in the proof of Theorem C? 

11. [M10] Use Theorem C to prove that if ( U n ) is oo-distributed, so is the subsequence 

<U 2 «). 

12. [. HM20 ] Show that a fc-distributed sequence passes the “maximum-of-/c test,” in 
the following sense: Pr (u < max(H n , U n +i, ..., U n +k— i) < v) — v k — u k . 

► 13. [ HM27 ] Show that an oo-distributed [0,1) sequence passes the “gap test” in the 
following sense: If0<a</?<1 and p = (3 — a, let /(0) = 0, and for n > 1 let 
f(n) be the smallest integer m > f(n — 1) such that a < U m < /?; then 

Pr(/(n) — f(n — 1) = k) = p( 1 — p)^ -1 . 

14. [HM25] Show that an oo-distributed sequence passes the “run test” in the follow¬ 
ing sense: If /(0) = 1 and if, for n > 1, f(n) is the smallest integer m > f(n — 1) such 
that Um —i > Um, then 

Pr(/(n) - f(n - 1) = k) = 2k/(k + 1)! - 2(k + 1 )/{k + 2)!. 

► 15. [ HMSO ] Show that an oo-distributed sequence passes the “coupon-collector’s test” 
when there are only two kinds of coupons, in the following sense: Let Xi, ... be an 
oo-distributed binary sequence. Let /(0) — 0, and for n > 1 let f(ri) be the smallest 
integer m > f(n — 1) such that {Xf( n -i)+i ,... ,X m } is the set {0,1}. Prove that 
Pr(/(n) — f(n — 1) = k) = 2 X ~ k , for k > 2. (Cf. exercise 7.) 

16. [. HM38 ] Does the coupon-collector’s test hold for oo-distributed sequences when 
there are more than two kinds of coupons? (Cf. the previous exercise.) 

17. [ HM50\ If r is any given rational number, Franklin has proved that the sequence 
(r n modl) is not 2-distributed. But is there any rational number r for which this 
sequence is equidistributed? In particular, is the sequence equidistributed when r = f ? 
[Cf. K. Mahler, Mathematika 4 (1957), 122-124.] 

► 18. [. HM22 ] Prove that if Ho, Hi, ... is /c-distributed, so is the sequence Vo, Vi, ..., 
where V n = |_nH n J/n. 

19. [ HM35 ] Consider a modification of Definition R4 that requires the subsequences 
to be only 1-distributed instead of oo-distributed. Is there a sequence that satisfies 
this weaker definition, but that is not oo-distributed? (Is the weaker definition really 
weaker?) 

20. [HM47] Does the sequence {0 n modl) satisfy Definition R4 for almost all real 
numbers 6 > 1? 

21. [ HM20 ] Let S be a set -and let M be a collection of subsets of S. Suppose that p 
is a real-valued function of the sets in M, such that p(M) denotes the probability that 
a “randomly” chosen element of S lies in M. Generalize Definitions B and D to obtain 
a good definition of the concept of a fc-distributed sequence (Z n ) of elements of S with 
respect to the probability distribution p. 
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► 22. [ HM80} (Hermann Weyl.) Show that the [0,1) sequence ( U n ) is fc-distributed if 
and only if 


lim Tt T, exp(2? ri( c iU n -\ - \-c k U n+k -i)) = 0 

N—kx JV *■—' 

0 < n < jV 

for every set of integers ci, c 2 , ..., c k not all zero. 

23. [MS4] Show that a h-ary sequence (X n ) is fc-distributed if and only if all of the 

sequences {c\X n + ca-Xn+i -\ - \- c k X n+k —i) are equidistributed, whenever ci, C 2 , 

..., Cfc are integers with gcd(ci,..., c k ) = 1. 

24. [ M25 ] Show that a [0,1) sequence ( U n ) is fc-distributed if and only if all of the 
sequences ( ciU n + C 2 U n +i + ■ • ■ + c k U n + k — i) are equidistributed, whenever c\, C 2 , 
..., c k are integers not all zero. 

25. [ HM20] A sequence is called a “white sequence” if all serial correlations are zero, 
i.e., if the equation in Corollary S is true for all k > 1. (By Corollary S, an oo- 
distributed sequence is white.) Show that if a [0,1) sequence is equidistributed, it is 
white if and only if 

lim - V {Uj — %)(U j+k — £) = 0, for all k > 1. 

n—*o o TL ' "* 

0 <3<n 


26. [ HMS4 ] (J. FYanklin.) A white sequence, as defined in the previous exercise, can 
definitely fail to be random. Let Uo, Ui, ... be an oo-distributed sequence, and define 
the sequence Vo, Vi, ... as follows: 

(V 271 — 11 V 2 n) — (U2n— 1, f^2n) if (f^2n —1, 6^2n) G G, 

("^2 n —1 , V2n) — (^2n, f^2n—l) if ( U2n —1 ( f^2n) tfL G, 

where G is the set {(x, y) | x — £ < y < x or x + \ < y }. Show that (a) V 0 , Vi, ... 
is equidistributed and white; (b) Pr(V n > K+i) = f. (This points out the weakness 
of the serial correlation test.) 

27. [HM 48 ] What is the highest possible value for Pr(V n > Vk+i) in an equidistrib¬ 
uted, white sequence? (D. Coppersmith has constructed such a sequence achieving the 
value |.) 

► 28. [HM21] Use the sequence (11) to construct a [0,1) sequence that is 3-distributed, 
for which Pr(t/ 2 n > j) = f. 

29. [ HM34 ] Let Xq, Xi, ... be a (2fc)-distributed binary sequence. Show that 

F ( x 2 „=o,<i + ( 2fc - i y2-. 

► 30. [ M39} Construct a binary sequence that is (2fc)-distributed, and for which 

Pr(* 2 „ = 0 ) = i + ( 2 \ _1 )/2“ 

(Therefore the inequality in the previous exercise is the best possible.) 
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31. [M30\ Show that [0,1) sequences exist that satisfy Definition R5, yet v n /n > \ 
for all n > 0, where v n is the number of j < n for which U n < (This might be 
considered a nonrandom property of the sequence.) 

32. [ M24 ] Given that ( X n ) is a “random” b ~ary sequence according to Definition R5, 
and that R is a computable subsequence rule that specifies an infinite subsequence 
(X n )R, show that the latter subsequence is not only 1-distributed, it is “random” by 
Definition R5. 

33. [ HM22] Let {U Tn ) and ( U Sn ) be infinite disjoint subsequences of a sequence ( U n )■ 
(Thus, ro < tt < r 2 < • • • and So < si < «2 < • • • are increasing sequences of 
integers and r m ^ s n for any m,n .) Let {U tn ) be the combined subsequence, so that 
to < h < t 2 < • • • and the set {£ n } = {r n } U {s n }- Show that if Pr(H rn £ A) = 
Pr(f/ Sn G A) = p, then Pr (U tn £ A) = p. 

► 34. [ M25 ] Define subsequence rules JZ 2 , JZ 3 , ■ • • such that Algorithm W can be 

used with these rules to give an effective algorithm to construct a [0,1) sequence 
satisfying Definition Rl. 

► 35. [ HM35} (D. W. Loveland.) Show that if a binary sequence (X n ) is R5-random, 
and if (s n ) is any computable sequence as in Definition R4, then Pr(X Sn = 1) > \ and 
Pr(X Sn = 1) < i. 

36. [ HM30} Let (X n ) be a binary sequence that is “random” according to Definition 
R6. Show that the [0,1) sequence {U n ) defined in binary notation by the scheme 


U 0 = (0X0)2 
Ui = (O.XiX 2 ) 2 
U 2 = {O.X 3 X 4 X 5 ) 2 
U 3 = (O.X 6 X 7 X 8 X 9 ) 2 


is random in the sense of Definition R6. 

37. [MS7] (D. Coppersmith.) Define a sequence that satisfies Definition R4 but not 
Definition R5. [Hint: Consider changing Ho, Hi, U 4 , Ug, ... in a truly random sequence.] 

38. [M49] (A. N. Kolmogorov.) Given N, n and e, what is the smallest number of 
algorithms in a set A such that no (n, e)-random binary sequences of length N exist 
with respect to A? (If exact formulas cannot be given, can asymptotic formulas be 
found? The point of this problem is to discover how close the bound (35) comes to 
being “best possible.”) 

39. [HM45] (W. M. Schmidt.) Let H n be a [0,1) sequence, and let u n {u) be the 
number of nonnegative integers j < n such that 0 < Uj < u. Prove that there is a 
positive constant c such that, for any N and for any [0,1) sequence (H n ), we have 

| u n {u) — un | > clniV 

for some n and u with 0<n<X, 0<u< 1. (In other words, no [0,1) sequence 
can be too equidistributed.) 

► 40. [16] (I. J. Good.) Can a valid table of random digits contain just one misprint? 
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3.6. SUMMARY 

We have COVERED a fairly large number of topics in this chapter: how to 
generate random numbers, how to test them, how to modify them in applications, 
and how to derive theoretical facts about them. Perhaps the main question in 
many readers’ minds will be, “What is the result of all this theory? What is a 
simple, virtuous generator I can use in my programs in order to have a reliable 
source of random numbers?” 

The detailed investigations in this chapter suggest that the following proce¬ 
dure gives the “nicest” and “simplest” random number generator for the machine 
language of most computers: At the beginning of the program, set an integer 
variable X to some value Xq. This variable X is to be used only for the purpose 
of random number generation. Whenever a new random number is required by 
the program, set 

X <— (aX 4- c) mod m (1) 

and use the new value of X as the random value. It is necessary to choose Xo, 
a, c, and m properly, and to use the random numbers wisely, according to the 
following principles: 

i) The “seed” number X 0 may be chosen arbitrarily. If the program is run 
several times and a different source of random numbers is desired each time, 
set Xo to the last value attained by X on the preceding run; or (if more 
convenient) set Xq to the current date and time. If the program may need 
to be rerun later with the same random numbers (e.g., when debugging), be 
sure to print out Xo if it isn’t otherwise known. 

ii) The number m should be large, say at least 2 30 . It may conveniently be 
taken as the computer’s word size, since this makes the computation of 
(aX -f- c)modm quite efficient. Section 3.2.1.1 discusses the choice of m in 
more detail. The computation of (aX -j- c) mod m must be done exactly , 
with no roundoff error. 

iii) If m is a power of 2 (i.e., if a binary computer is being used), pick a so that 
a mod 8 = 5. If m is a power of 10 (i.e., if a decimal computer is being used), 
choose a so that a mod 200 = 21. This choice of a together with the choice 
of c given below ensures that the random number generator will produce 
all m different possible values of X before it starts to repeat (see Section 
3.2.1.2) and ensures high “potency” (see Section 3.2.1.3). 

iv) The multiplier a should preferably be chosen between .01m and .99m, and 
its binary or decimal digits should not have a simple, regular pattern. By 
choosing some haphazard constant like a = 3141592621 (which satisfies 
both of the conditions in (iii)), one almost always obtains a reasonably good 
multiplier. Further testing should of course be done if the random number 
generator is to be used extensively; for example, there should be no large 
quotients when Euclid’s algorithm is used to find the gcd of a and m (see 
Section 3.3.3). The multiplier should pass the spectral test (Section 3.3.4) 
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and several tests of Section 3.3.2, before it is considered to have a truly clean 
bill of health. 

v) The value of c is immaterial when a is a good multiplier, except that c must 
have no factor in common with m. Thus we may choose c = 1 or c = a. 

vi) The least significant (right-hand) digits of X are not very random, so deci¬ 
sions based on the number X should always be influenced primarily by the 
most significant digits. It is generally best to think of X as a random frac¬ 
tion Xjm between 0 and 1, that is, to visualize X with a decimal point at 
its left, rather than to regard X as a random integer between 0 and m — 1. 
To compute a random integer between 0 and k — 1, one should multiply 
by k and truncate the result (see the beginning of Section 3.4.2). 

vii) An important limitation on the randomness of sequence (1) is discussed in 
Section 3.3.4, where it is shown that the “accuracy” in t dimensions will 
be only about one part in y/rh. Monte Carlo applications requiring higher 
resolution can improve the randomness by employing techniques discussed 
in Section 3.2.2. 

The above comments apply primarily to machine-language coding. In higher- 
level programming languages, we are often unable to use such machine-dependent 
features as integer arithmetic modulo the word size, and careful compilers will not 
allow us to compute the product of two large integers. Another technique that we 
might call the subtractive method (exercise 3.2.2-23) can be used to provide a 
“portable” random number generator that is efficiently describable in any higher- 
level programming language, since it makes use only of integer arithmetic be¬ 
tween —10 9 and -(-10 9 . Here is how the subtractive method might be coded in 
FORTRAN, as a subroutine that delivers an array of 55 random integers at 
once: 

FUNCTION IRN55(IA) 

DIMENSION IA(1) 

C ASSUMING THAT IA(1), .... IA(55) HAVE BEEN SET UP PROPERLY, 

C THIS SUBROUTINE RESETS THE IA ARRAY TO THE NEXT 55 NUMBERS 

C OF A PSEUDO-RANDOM SEQUENCE, AND RETURNS THE VALUE 1. 

DO 1 I = 1, 24 

J = IA(I) - IA(I+31) 

IF (J .LT. 0) J = J + 1000000000 
IA(I) = J 

1 CONTINUE 

DO 2 I = 25, 55 

J = IA(I) - IA(I-24) 

IF (J .LT. 0) J = J + 1000000000 
IA(I) = J 

2 CONTINUE 
IRN55=1 
RETURN 
END 
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To use these numbers in a FORTRAN program, let JRAND be an auxiliary integer 
variable; we may obtain the next random number U (as a fraction between 0 
and 1) by writing the following three statements: 

JRAND = JRAND + 1 

IF (JRAND .GT. 55) JRAND = IRN55(IA) 

U = FLOAT(IA(JRAND)) * 1.0E-9 

At the beginning of our program we need to write “DIMENSION IA(55)” and 
“JRAND=55” and we must also initialize the IA array. Appropriate initialization 
can be done by calling the following subroutine with IX set to any integer value 
(selected like X 0 in rule (i) above, preferably large): 

SUBROUTINE IN55(IA,IX) 

DIMENSION IA(1) 

C THIS SUBROUTINE SETS IA(1).IA(55) TO STARTING 

C VALUES SUITABLE FOR LATER CALLS ON IRN55(IA). 

C IX IS AN INTEGER "SEED" VALUE BETWEEN 0 AND 999999999. 

IA(55) = IX 
J = IX 
K = 1 

DO 1 I = 1, 54 

II = M0D(21*I, 55) 

IA(II) = K 
K = J - K 

IF (K .LT. 0) K = K + 1000000000 
J = IA(II) 

1 CONTINUE 

C THE NEXT THREE LINES "WARM UP" THE GENERATOR. 

IRN55(IA) 

IRN55(IA) 

IRN55(IA) 

RETURN 

END 

(This subroutine computes a Fibonacci-like sequence; multiplication of indices 
by 21 spreads the values around so as to alleviate initial nonrandomness problems 
such as those in exercise 3. 2 . 2 - 2 . Note that 2 29 < 10 9 < 2 30 ; any large even 
number may actually be substituted for 10 9 in these routines, if a corresponding 
change is made in the computation of the random fraction U. Furthermore it 
would be possible to work directly with floating point numbers instead of integers 
by making appropriate changes to these programs, provided that the computer’s 
floating point arithmetic is sufficiently accurate to give exact results in all the 
computations required here. Most binary computers will be able to meet this 
requirement when all of the numbers to be dealt with have the form a/2 p , where 
a is an integer, 0 < a < 2 P , and there are p bits of precision in floating point 
fractions. The numbers (24,55) in these routines may be replaced by any pair 
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of values (j, k) in Table 3.2.2-1, for k > 50; the constants 31, 25, 54, 21 should 
then be replaced by k — j, j -f- 1, k — 1, and d respectively, where d is relatively 
prime to k and d « 0.382fc.) 

Although a great deal is known about linear congruential sequences like (1), 
very little has yet been proved about the randomness properties of the subtractive 
method. Both approaches seem to be reliable in practice. 

Unfortunately, quite a bit of published material in existence at the time 
this chapter was written recommends the use of generators that violate the 
suggestions above; and the most common generator in actual use, RANDU, is really 
horrible (cf. Section 3.3.4). The authors of many contributions to the science 
of random number generation were unaware that particular methods they were 
advocating would prove to be inadequate. Perhaps further research will show 
that even the random number generators recommended here are unsatisfactory; 
we hope this is not the case, but the history of the subject warns us to be 
cautious. The most prudent policy for a person to follow is to run each Monte 
Carlo program at least twice using quite different sources of random numbers, 
before taking the answers of the program seriously; this not only will give an 
indication of the stability of the results, it also will guard against the danger of 
trusting in a generator with hidden deficiencies. (Every random number generator 
will fail in at least one application.) 

Excellent bibliographies of the pre-1972 literature on random number gen¬ 
eration have been compiled by Richard E. Nance and Claude Overstreet, Jr., 
Computing Reviews 13 (1972), 495-508, and by E. R. Sowey, International 
Stat. Review 40 (1972), 355-371. The period 1972-1976 is covered by Sowey 
in International Stat. Review 46 (1978), 89-102. 

For a detailed study of the use of random numbers in numerical analysis, 
see J. M. Hammersley and D. C. Handscomb, Monte Carlo Methods (London: 
Methuen, 1964). This book shows that some numerical methods are enhanced 
by using numbers that are “quasi-random,” designed specifically for a certain 
purpose (not necessarily satisfying the statistical tests we have discussed). 

Every reader is urged to work exercise 6 in the following set of problems. 


EXERCISES 

1. [21] Write a MIX subroutine with the following characteristics, using method (1): 
Calling sequence: JMP RANDI 

Entry conditions: rA = k, a positive integer < 5000. 

Exit conditions: r A ■*— a random integer Y, 1 < Y < k, with each integer 

about equally probable; rX =?; overflow off. 

► 2. [15] Some people have been afraid that computers will someday take over the 
world; but they are reassured by the statement that a machine cannot do anything 
really new, since it is only obeying the commands of its master, the programmer. 
Lady Lovelace wrote in 1844, “The Analytical Engine has no pretensions to originate 
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anything. It can do whatever we know how to order it to perform.” Her statement 
has been further elaborated by many philosophers. Discuss this topic, with random 
number generators in mind. 

3. [32] (A dice game.) Write a program that simulates a roll of two dice, each of 
which takes on the values 1, 2, ..., 6 with equal probability. If the total is 7 or 11 on 
the first roll, the game is won; a total of 2, 3, or 12 loses; and on any other total, call 
that total the “point” and continue rolling dice until either a 7 occurs (a loss) or the 
point occurs again (a win). 

Play ten games. The result of each roll of the dice should be printed in the form 
mn, where m and n are the contents of the two dice, followed by some appropriate 
comment (like “snake eyes” or “little Joe” or “the hard way”, etc.). 

4. [40] (Solitaire or patience.) Some people spend a lot of valuable time playing card 
games of solitaire, and perhaps automation will make an important inroad in this area. 
Write a program that (a) shuffles a simulated deck of cards; (b) plays some common 
game of solitaire based on the order of the cards in the deck; (c) prints out the result 
of the game, i.e., how close the program came to winning. Several games should be 
played. The program might be set up to “cheat” upon request. 

5. [46] (Creative writing by computer.) A television program entitled “The Thinking 
Machine,” broadcast by the CBS television network on October 26, 1960, featured 
(among other things) two Western-style playlets that were written by a computer 
program. Here are the two scripts as they were printed out by the computer: 

Saga #1. (The gun is in the right hand; the money is in the left hand; the glass is 
on the table; the bottle is on the table; the holster is on the robber; the sheriff’s gun 
is in the sheriff’s right hand; the sheriff’s holster is on the sheriff.) 

ROBBER: (The robber is at the window.) Go to door; open door; go thru door; 

close door; go to corner; put money down at corner; go to table; put gun 
on table; sit and fidgit; sit at table; pick up glass with right hand (empty); 
put glass on table; pick up bottle with right hand; pour; put bottle on 
table; pick up glass with right hand; take a drink from glass; put glass 
on table; pick up bottle with right hand; sit at table; sit at table; go to 
corner; go to window; go to table; put bottle on table; sit and fidgit; sit 
at table; sit and fidgit; go to window; go to table; pick up glass with right 
hand. 

SHERIFF: (The sheriff is at the window.) See robber; (robber sees sheriff); go to 
door. 

ROBBER: Put glass on table; pick up gun with right hand; check gun. 

SHERIFF: Wait; open door; see robber; (robber sees sheriff); go thru door. 
ROBBER: Go to window; aim; fire; SHERIFF NICKED. 

SHERIFF: Go to window; aim; fire; MISSED; go to door; go to window. 

ROBBER: Go to door; aim; aim. 

SHERIFF: Aim; fire; MISSED. 

ROBBER: Fire; SHERIFF NICKED. 

SHERIFF: Go to door; aim; fire; MISSED; go thru door; aim. 

ROBBER: Aim; fire; MISSED; aim; fire; MISSED. 

SHERIFF: Fire; MISSED; go to window; aim; fire; MISSED. 

ROBBER: Aim; fire; MISSED; aim; fire; MISSED; aim; fire; SHERIFF NICKED. 
SHERIFF: Aim; fire; ROBBER HIT. 

ROBBER: Drop gun; robber dies. 
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SHERIFF: Put gun in holster; go to table; pick up glass with right hand (empty); take 
glass from right hand with left hand; pick up bottle with right hand; pour; 
put bottle on table; take glass from left hand with right hand; take a drink 
from glass; take glass from right hand with left hand; pick up bottle with 
right hand; pour; put bottle on table; take glass from left hand with right 
hand; take a drink from glass; put glass on table; go to corner; pick up 
money with right hand; go to door; go thru door; close door. CURTAIN. 

Saga #2. (The gun is in the right hand; the money is in the left hand; the glass is 

on the table; the bottle is on the table; the holster is on the robber; the sheriff’s gun 

is in the sheriff’s right hand; the sheriff’s holster is on the sheriff.) 

ROBBER: (The robber is at the window.) Go to door; open door; go thru door; 

close door; go to corner; put money down at corner; go to window; put 
gun down at window; lean on window and look; lean on window and look; 
go to corner; count money; go to table; pick up glass with right hand 
(empty); take glass from right hand with left hand; pick up bottle with 
right hand; pour; put bottle on table; take glass from left hand with right 
hand; take a drink from glass; put glass on table; pick up bottle with right 
hand; pour; go to corner; put bottle down at corner; go to window; pick 
up gun with right hand; check gun; put gun in holster; go to table; pick 
up glass with right hand; take a drink from glass; go to window; put glass 
down at window. 

SHERIFF: (The sheriff is at the window.) See robber; (robber sees sheriff); go to 
door. 

ROBBER: Take gun from holster with right hand; check gun; go to door; check gun; 
put gun down at door. 

SHERIFF: Open door; see robber; (robber sees sheriff); go thru door; go to window. 

ROBBER: Pick up gun with right hand. 

SHERIFF: Go to table. 

ROBBER: Aim; fire; MISSED; aim; fire; SHERIFF HIT; blow out barrel; put gun in 
holster. 

SHERIFF: Drop gun; sheriff dies. 

ROBBER: Go to corner; pick up money with right hand; go to door; go thru door; 
close door. CURTAIN. 
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A careful reading of the above scripts reveals the highly intense drama present 
here. The computer program was careful to keep track of the locations of each player, 
the contents of his hands, etc. Actions taken by the players were random, governed 
by certain probabilities; the probability of a foolish action was increased depending on 
how much that player had had to drink and on how often he had been nicked in a shot. 
The reader will be able to deduce further properties of the program by studying the 
sample scripts. 

Of course, even the best scripts are rewritten before they are produced, and this 
is especially true when an inexperienced writer has prepared the original draft. Here 
are the scripts just as they were actually used in the show: 

Saga #1. Music up. 

MS Robber peering thru window of shack. 

CU Robber’s face. 

MS Robber entering shack. 

CU Robber sees whiskey bottle on table. 

CU Sheriff outside shack. 

MS Robber sees sheriff. 

LS Sheriff in doorway over shoulder of robber, both draw. 

MS Sheriff drawing gun. 

LS Shooting it out. Robber gets shot. 

MS Sheriff picking up money bags. 

MS Robber staggering. 

MS Robber dying. Falls across table, after trying to take last shot at sheriff. 

MS Sheriff walking thru doorway with money. 

MS of robber’s body, now still, lying across table top. Camera dollies back. (Laughter) 

Saga #2. Music up. 

CU of window. Robber appears. 

MS Robber entering shack with two sacks of money. 

MS Robber puts money bags on barrel. 

CU Robber—sees whiskey on table. 

MS Robber pouring himself a drink at table. Goes to count money. Laughs. 

MS Sheriff outside shack. 

MS thru window. 

MS Robber sees sheriff thru window. 

LS Sheriff entering shack. Draw. Shoot it out. 

CU Sheriff. Writhing from shot. 

M/2 shot Sheriff staggering to table for a drink . . . falls dead. 

MS Robber leaves shack with money bags.* 

[Note: CU = “close up”, MS = “medium shot”, etc. The above details were 
kindly furnished to the author by Thomas H. Wolf, producer of the television show, 
who suggested the idea of a computer-written playlet in the first place, and also by 
Douglas T. Ross and Harrison R. Morse who produced the computer program.] 

The reader will undoubtedly have many ideas about how he could prepare his own 
computer program to do creative writing; and that is the point of this exercise. 

► 6. [40] Look at the subroutine library of each computer installation in your organiza¬ 
tion, and replace the random number generators by good ones. Try to avoid being too 
shocked at what you find. 

*© 1962 by Columbia Broadcasting System, Inc. All Rights Reserved. Used by permission. 
For further information, see J. E. Pfeiffer, The Thinking Machine (New York: J. B. Lippincott, 
1962). 
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► 7. [M4-0] A programmer decided to encipher his files by using a linear congruential 
sequence { X n ) of period 2 32 generated by (a, c, X 0 ) with m — 2 32 . He took the most 
significant bits |X n /2 16 J and exclusive-or’ed them onto his data, but kept the param¬ 
eters a, c, and Xo secret. 

Show that this isn’t a very secure scheme, by devising a method that deduces the 
multiplier a and the first difference X\ — Xo in a reasonable amount of time, given 
only the values of LX n /2 16 J for 0 < n < 150. 



CHAPTER FOUR 


ARITHMETIC 


Seeing there is nothing (right well beloved Students in the Mathematickes) 
that is so troublesome to Mathematical! practise, nor that doth more molest 
and hinder Calculators, then the Multiplications, Divisions, square and 
cubical Extractions of great numbers, which besides the tedious 
expence of time are for the most part subject to many slippery errors. 
I began therefore to consider in my minde, by what certaine and 

ready Art 1 might remove those hindrances. 

—JOHN NAPIER (1614) 

/ do hate sums. There is no greater mistake than to call arithmetic an exact 
science. There are . . . hidden laws of number which it requires a mind 
like mine to perceive. For instance, if you add a sum from the bottom up, 
and then again from the top down, the result is always different. 

—MRS. LA TOUCHE (19th century) 

I cannot conceive that anybody will require multiplications at the rate 
of 40,000, or even 4,000 per hour; such a revolutionary change as the 
octonary scale should not be imposed upon mankind in general 

for the sake of a few individuals. 

—F. H. WALES (1936) 

Most numerical analysts have no interest in arithmetic. 

—B. PARLETT (19T9) 


The chief purpose of this chapter is to make a careful study of the four 
basic processes of arithmetic: addition, subtraction, multiplication, and division. 
Many people regard arithmetic as a trivial thing that children learn and com¬ 
puters do, but we will see that arithmetic is a fascinating topic with many inter¬ 
esting facets. It is important to make a thorough study of efficient methods for 
calculating with numbers, since arithmetic underlies so many computer applica¬ 
tions. 

Arithmetic is, in fact, a lively subject that has played an important part in 
the history of the world, and it still is undergoing rapid development. In this 
chapter, we shall analyze algorithms for doing arithmetic operations on many 
types of quantities, such as “floating point” numbers, extremely large numbers, 
fractions (rational numbers), polynomials, and power series; and we will also 
discuss related topics such as radix conversion, factoring of numbers, and the 
evaluation of polynomials. 
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4.1. POSITIONAL NUMBER SYSTEMS 

THE WAY WE DO ARITHMETIC is intimately related to the way we represent 
the numbers we deal with, so it is appropriate to begin our study of the subject 
with a discussion of the principal means for representing numbers. 

Positional notation using base 6 (or radix b) is defined by the rule 

(... a 3 a 2 aiGo.a_ia_2 .. - )b 

=-b 0 , 3 b 3 + a 2 b 2 + aib 1 + a 0 + a_i6 _1 + a_ 2 6~ 2 H-; (1) 

for example, (520.3) 6 = 5 ■ 6 2 + 2 ■ 6 1 + 0 + 3 • 6 —1 — 192 J. Our conventional 
decimal number system is, of course, the special case when b is ten, and when 
the a’s are chosen from the “decimal digits” 0, 1, 2, 3, 4, 5, 6 , 7, 8 , 9; in this 
case the subscript b in ( 1 ) may be omitted. 

The simplest generalizations of the decimal number system are obtained 
when we take b to be an integer greater than 1 and when we require the a’s to 
be integers in the range 0 < a,k < b. This gives us the standard binary (b = 2 ), 
ternary (6 = 3), quaternary (6 = 4), quinary (6=5),... number systems. In 
general, we could take 6 to be any nonzero number, and we could choose the a’s 
from any specified set of numbers; this leads to some interesting situations, as 
we shall see. 

The dot that appears between ao and a_i in (1) is called the radix point. 
(When b = 10, it is also called the decimal point, and when 6 = 2, it is sometimes 
called the binary point, etc.) Continental Europeans often use a comma instead 
of a dot to denote the radix point; Englishmen often use a raised dot. 

The a’s in ( 1 ) are called the digits of the representation. A digit a^ for large k 
is often said to be “more significant” than the digits a^ for small k; accordingly, 
the leftmost or “leading” digit is referred to as the most significant digit and the 
rightmost or “trailing” digit is referred to as the least significant digit In the 
standard binary system the binary digits are often called bits ; in the standard 
hexadecimal system (radix sixteen) the hexadecimal digits zero through fifteen 
are usually denoted by 

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. 

The historical development of number representations is a fascinating story, 
since it parallels the development of civilization itself. We would be going far 
afield if we were to examine this history in minute detail, but it will be instructive 
to look at its main features here. 

The earliest forms of number representations, still found in primitive cul¬ 
tures, are generally based on groups of fingers, piles of stones, etc., usually with 
special conventions about replacing a larger pile or group of, say, five or ten 
objects by one object of a special kind or in a special place. Such systems lead 
naturally to the earliest ways of representing numbers in written form, as in the 
systems of Babylonian, Egyptian, Greek, Chinese, and Roman numerals; but 
such notations are comparatively inconvenient for performing arithmetic opera¬ 
tions except in the simplest cases. 
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During the twentieth century, historians of mathematics have made exten¬ 
sive studies of early cuneiform tablets found by archeologists in the Middle East. 
These studies show that the Babylonian people actually had two distinct systems 
of number representation: Numbers used in everyday business transactions were 
written in a notation based on grouping by tens, hundreds, etc.; this notation was 
inherited from earlier Mesopotamian civilizations, and large numbers were seldom 
required. When more difficult mathematical problems were considered, however, 
Babylonian mathematicians made extensive use of a sexagesimal (radix sixty) 
positional notation that was highly developed at least as early as 1750 B.C. This 
notation was unique in that it was actually a floatingpoint form of representation 
with exponents omitted; the proper scale factor or power of sixty was to be sup¬ 
plied by the context, so that, for example, the numbers 2, 120, 7200, and were 
all written in an identical manner. The notation was especially convenient for 
multiplication and division, using auxiliary tables, since radix-point alignment 
had no effect on the answer. As examples of this Babylonian notation, consider 
the following excerpts from early tables: The square of 30 is 15 (which may also 
be read, “The square of \ is J”); the reciprocal of 81 = (1 21)60 is (44 26 40) 6 o; 
and the square of the latter is (32 55 18 31 6 40) 6 o. The Babylonians had a sym¬ 
bol for zero, but because of their “floating point” philosophy, it was used only 
within numbers, not at the right end to denote a scale factor. For the interesting 
story of early Babylonian mathematics, see O. Neugebauer, The Exact Sciences 
in Antiquity (Princeton, N. J.: Princeton University Press, 1952), and B. L. van 
der Waerden, Science Awakening, tr. by A. Dresden (Groningen: P. Noordhoff, 
1954); see also D. E. Knuth, CACM 15 (1972), 671-677; 19 (1976), 108. 

Fixed point positional notation was apparently first conceived by the Maya 
Indians in central America 2000 years ago; their radix-20 system was highly 
developed, especially in connection with astronomical records and calendar dates. 
But the Spanish conquerors destroyed nearly all of the Maya books on history 
and science, so we have comparatively little knowledge about how sophisticated 
the native Americans had become at arithmetic; special-purpose multiplication 
tables have been found, but no examples of division are known [cf. J. Eric S. 
Thompson, Contributions to Amer. Anthropology and History 7 (Carnegie Inst, 
of Washington, 1942), 37-62). 

Several centuries before Christ, the Greek people employed an early form 
of the abacus to do their arithmetical calculations, using sand and/or pebbles 
on a board that had rows or columns corresponding in a natural way to our 
decimal system. It is perhaps surprising to us that the same positional notation 
was never adapted to written forms of numbers, since we are so accustomed to 
reckoning with the decimal system using pencil and paper; but the greater ease 
of calculating by abacus (since handwriting was not a common skill, and since 
abacus calculation makes it unnecessary to memorize addition and multiplication 
tables) probably made the Greeks feel it would be silly even to suggest that 
computing could be done better on “scratch paper.” At the same time Greek 
astronomers did make use of a sexagesimal positional notation for fractions, 
which they had learned from the Babylonians. 
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Our decimal notation, which differs from the more ancient forms primarily 
because of its fixed radix point, together with its symbol for zero to mark an 
empty position, was developed first in India within the Hindu culture. The exact 
date when this notation first appeared is quite uncertain; about 600 A.D. seems 
to be a good guess. Hindu science was rather highly developed at that time, 
particularly in astronomy. The earliest known Hindu manuscripts that show 
this notation have numbers written backwards (with the most significant digit 
at the right), but soon it became standard to put the most significant digit at 
the left. 

About 750 A.D., the Hindu principles of decimal arithmetic were brought 
to Persia, as several important works were translated into Arabic; a picturesque 
account of this development is given in a Hebrew document, which has been 
translated into English in AMM 15 (1918), 99-108. Not long after this, al- 
Khwarizmi wrote his Arabic textbook on the subject. (As noted in Chapter 1, 
our word “algorithm” comes from al-Khwarizmi’s name.) His work was trans¬ 
lated into Latin and was a strong influence on Leonardo Pisano (Fibonacci), 
whose book on arithmetic (1202 A.D.) played a major role in the spreading of 
Hindu-Arabic numerals into Europe. It is interesting to note that the left-to-right 
order of writing numbers was unchanged during these two transitions, although 
Arabic is written from right to left while Hindu and Latin scholars generally 
wrote from left to right. A detailed account of the subsequent propagation of 
decimal numeration and arithmetic into all parts of Europe during the period 
from 1200 to 1600 A.D. has been given by David Eugene Smith in his History of 
Mathematics 1 (Boston: Ginn and Co., 1923), Chapters 6 and 8. 

Decimal notation was applied at first only to integer numbers, not to frac¬ 
tions. Arabic astronomers, who required fractions in their star charts and other 
tables, continued to use the notation of Ptolemy (the famous Greek astronomer), 
a notation based on sexagesimal fractions. This system still survives today, in 
our trigonometric units of “degrees, minutes, and seconds,” and also in our units 
of time, as a remnant of the original Babylonian sexagesimal notation. Early 
European mathematicians also used sexagesimal fractions when dealing with 
noninteger numbers; for example, Fibonacci gave the value 

1° 22' 7" 42 ,// 33^ 4 V 40 w 

as an approximation to the root of the equation x 3 -f 2x 2 -f lOx = 20. (The 
correct answer is 1° 22' 7" 42 /// 33^ 4 V 38 w 30 w 50 WJJ 15' x 43 x ... .) 

The use of decimal notation also for tenths, hundredths, etc., in a similar 
way seems to be a comparatively minor change; but, of course, it is hard to 
break with tradition, and sexagesimal fractions have an advantage over decimal 
fractions in that numbers such as J can be expressed exactly, in a simple way. 

Chinese mathematicians—who never used sexagesimals—were apparently 
the first people to work with the equivalent of decimal fractions, although their 
numeral system (lacking zero) was not originally a positional number system in 
the strict sense. Chinese units of weights and measures were decimal, so that 
Tsu Chhung-Chih (who died c. 500 A.D.) was able to express an approximation 
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to 7r in the following form: 

3 chang, 1 chhih, 4 tshun, 1 fen, 5 li, 9 hao, 2 miao, 7 hu. 

Here chang, ..., hu are units of length; 1 hu (the diameter of a silk thread) equals 
1/10 miao, etc. The use of such decimal-like fractions was fairly widespread in 
China after about 1250 A.D. 

The first known appearance of decimal fractions in a true positional system 
occurs in a 10th-century arithmetic text written in Damascus by an obscure 
mathematician named al-Uqlidisi (“the Euclidean”). He used the symbol ' for 
a decimal point, for example in connection with a problem about compound 
interest, the computation of 135 times (l.l) n for 1 < n < 5. [See A. S. Saidan, 
The Arithmetic of al-Uqlidisi (Dordrecht: D. Reidel, 1975), 110, 114, 343, 355, 
481-485.] But he did not develop the idea very fully, and his trick was soon 
forgotten; five centuries passed before decimal fractions were reinvented by a 
Persian mathematician, al-Kashi, who died c. 1436. Al-Kashi was a highly skillful 
calculator, who gave the value of 27r as follows, correct to 16 decimal places: 


integer 

fractions 

0 

6 

2 

8 

3 

1 

8 

5 

3 

lL 

7 

1 

7 

9 

5 

8 

6 

5 


This was by far the best approximation to tt known until Ludolph van Ceulen 
laboriously calculated 35 decimal places during the period 1596-1610. 

The earliest known example of decimal fractions in Europe occurs in a 15th- 
century text where, for example, 153.5 is multiplied by 16.25 to get 2494.375; this 
was referred to as a “Turkish method.” In 1525, Christof Rudolff of Germany 
discovered decimal fractions for himself; but like al-Uqlidisi, his work seems 
to have had little influence. Francois Viete suggested the idea again in 1579. 
Finally, an arithmetic text by Simon Stevin of Belgium, who independently hit 
on the idea of decimal fractions in 1585, became popular. Stevin’s work, and the 
discovery of logarithms soon afterwards, made decimal fractions commonplace 
in Europe during the 17th century. [See D. E. Smith, History of Mathematics 2 
(Boston: Ginn and Co., 1925), 228-247, and C. B. Boyer, History of Mathematics 
(New York: Wiley, 1968), for further remarks and references.] 

The binary system of notation has its own interesting history. Many primi¬ 
tive tribes in existence today are known to use a binary or “pair” system of 
counting (making groups of two instead of five or ten), but they do not count in 
a true radix-2 system, since they do not treat powers of 2 in a special manner. 
See The Diffusion of Counting Practices by Abraham Seidenberg, Univ. Calif. 
Publ. in Math. 3 (1960), 215-300, for interesting details about primitive number 
systems. Another “primitive” example of an essentially binary system is the 
conventional musical notation for expressing rhythms and durations of time. 

Nondecimal number systems were discussed in Europe during the seven¬ 
teenth century. For many years astronomers had occasionally used sexagesimal 
arithmetic both for the integer and the fractional parts of numbers, primarily 
when performing multiplication [see John Wallis, Treatise of Algebra (Oxford, 
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1685), 18-22, 30]. The fact that any integer greater than 1 could serve as radix 
was apparently first stated in print by Blaise Pascal in De numeris multiplicibus, 
which was written about 1658 [see Pascal’s (Euvres Completes (Paris: Editions 
de Seuil, 1963), 84-89]. Pascal wrote, “Denaria enim ex institute hominum, 
non ex necessitate naturae ut vulgus arbitratur, et sane satis inepte, posita est”; 
i.e., “The decimal system has been established, somewhat foolishly to be sure, 
according to man’s custom, not from a natural necessity as most people would 
think.” He stated that the duodecimal (radix twelve) system would be a welcome 
change, and he gave a rule for testing a duodecimal number for divisibility by 
nine. Erhard Weigel tried to drum up enthusiasm for the quaternary (radix four) 
system in a series of publications beginning in 1673. A detailed discussion of 
radix-twelve arithmetic was given by Joshua Jordaine, Duodecimal Arithmetick 
(London, 1687). 

Although decimal notation was almost exclusively used for arithmetic during 
that era, other systems of weights and measures were rarely if ever based on 
multiples of 10, and many business transactions required a good deal of skill in 
adding quantities such as pounds, shillings, and pence. For centuries merchants 
had therefore learned to compute sums and differences of quantities expressed 
in peculiar units of currency, weights, and measures; and this was actually 
arithmetic in a nondecimal number system. The common units of liquid measure 
in England, dating from the 13th century or earlier, are particularly noteworthy: 


2 gills = 1 
2 chopins = 1 
2 pints = 1 
2 quarts = 1 
2 pottles = 1 
2 gallons = 1 
2 pecks — 1 


chopin 

pint 

quart 

pottle 

gallon 

peck 

demibushel 


2 demibushels 
2 firkins 
2 kilderkins 
2 barrels 
2 hogsheads 
2 pipes 


1 bushel or firkin 
1 kilderkin 
1 barrel 
1 hogshead 
1 pipe 
1 tun 


Quantities of liquid expressed in gallons, pottles, quarts, pints, etc. were essen¬ 
tially written in binary notation. Perhaps the true inventors of binary arithmetic 
were English wine merchants! 

The first known appearance of pure binary notation was about 1605 in some 
unpublished manuscripts of Thomas Harriot (1560-1621). Harriot was a creative 
man, who first became famous by coming to America as a representative of Sir 
Walter Raleigh. He invented (among other things) a notation like that now used 
for “less than” and “greater than” relations; but for some reason he chose not 
to publish many of his discoveries. Excerpts from his notes on binary arithmetic 
have been reproduced by John W. Shirley, Amer. J. Physics 19 (1951), 452-454. 
The first published discussion of the binary system was given in a comparatively 
little-known work by a Spanish bishop, Juan Caramuel Lobkowitz, Mathesis 
biceps 1 (Campaniae, 1670), 45-48; Caramuel discussed the representation of 
numbers in radices 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, and 60 at some length, but 
gave no examples of arithmetic operations in nondecimal systems (except for 
the trivial operation of adding unity). 
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Ultimately, an article by G. W. Leibniz [Memoires de l’Academie Royale des 
Sciences (Paris: 1703), 110-116], which illustrated binary addition, subtraction, 
multiplication, and division, really brought binary notation into the limelight, 
and this article is usually referred to as the birth of radix-2 arithmetic. Leibniz 
later referred to the binary system quite frequently. He did not recommend it for 
practical calculations, but he stressed its importance in number-theoretical inves¬ 
tigations, since patterns in number sequences are often more apparent in binary 
notation than they are in decimal; he also saw a mystical significance in the fact 
that everything is expressible in terms of zero and one. Leibniz’s unpublished 
manuscripts show that he had been interested in binary notation as early as 
1679, when he referred to it as a “bimal” system (analogous to “decimal”). 

A careful study of Leibniz’s early work with binary numbers has been made 
by Hans J. Zacher, Die Hauptschriften zur Dyadik von G. W. Leibniz (Frankfurt 
am Main: Klostermann, 1973). Zacher points out that Leibniz was familiar with 
John Napier’s so-called “local arithmetic,” a way for calculating with stones 
that amounts to using a radix-2 abacus. [Napier had published the idea of local 
arithmetic as an appendix to his little book Rhabdologia in 1617; it may be 
called the world’s first “binary computer,” and it is surely the world’s cheapest, 
although Napier felt that it was more amusing than practical. See Martin 
Gardner’s discussion in Scientific American 228 (April 1973), 106-111.] 

It is interesting to note that the important concept of negative powers to the 
right of the radix point was not yet well understood at that time. Leibniz asked 
James Bernoulli to calculate 7r in the binary system, and Bernoulli “solved” the 
problem by taking a 35-digit approximation to n, multiplying it by 10 35 , and 
then expressing this integer in the binary system as his answer. On a smaller 
scale this would be like saying that n 3.14, and (314)io = (100111010)2; hence 
7 r in binary is 100111010! [See Leibniz, Math. Schriften, ed. by K. Gehrhardt, 
3 (Halle: 1855), 97; two of the 118 bits in the answer are incorrect, due to 
computational errors.] The motive for Bernoulli’s calculation was apparently to 
see whether any simple pattern could be observed in this representation of ir. 

Charles XU of Sweden, whose talent for mathematics perhaps exceeded that 
of all other kings in the history of the world, hit on the idea of radix-8 arithmetic 
about 1717. This was probably his own invention, although he had met Leibniz 
briefly in 1707. Charles felt that radix 8 or 64 would be more convenient 
for calculation than the decimal system, and he considered introducing octal 
arithmetic into Sweden; but he died in battle before decreeing such a change. 
[See The Works of Voltaire 21 (Paris: E. R. DuMont, 1901), 49; E. Swedenborg, 
Gentleman’s Magazine 24 (1754), 423-424.] 

Octal notation was proposed also in colonial America before 1750, by the 
Rev. Hugh Jones, rector of a parish in Maryland [cf. Gentleman’s Magazine 15 
(1745), 377-379; H. R. Phalen, AMM 56 (1949), 461-465]. 

More than a century later, a prominent Swedish-American civil engineer 
named John W. Nystrom decided to carry Charles XU’s plans a step further, 
by devising a complete system of numeration, weights, and measures based on 
radix-16 arithmetic. He wrote, “I am not afraid, or do not hesitate, to advocate a 
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binary system of arithmetic and metrology. I know I have nature on my side; if I 
do not succeed to impress upon you its utility and great importance to mankind, 
it will reflect that much less credit upon our generation, upon our scientific men 
and philosophers.” Nystrom devised special means for pronouncing hexadecimal 
numbers; e.g., (B0160)i6 was to be read “vybong, bysanton.” His entire system 
was called the Tonal System, and it is described in J. Franklin Inst. 46 (1863), 
263-275, 337-348, 402-407. A similar system, but using radix 8, was worked 
out by Alfred B. Taylor [Proc. Amer. Pharmaceutical Assoc . 8 (1859), 115-216; 
Proc. Amer. Philosophical Soc. 24 (1887), 296-366]. Increased use of the French 
(metric) system of weights and measures prompted extensive debate about the 
merits of decimal arithmetic during that era; indeed, octal arithmetic was even 
being proposed in France [J. D. Colenne, Le systeme octaval (Paris: 1845); Aime 
Mariage, Numeration par huit (Paris: Le Nonnant, 1857)]. 

The binary system was well known as a curiosity ever since Leibniz’s time, 
and about 20 early references to it have been compiled by R. C. Archibald 
[AMM 25 (1918), 139-142]. It was applied chiefly to the calculation of powers, 
as explained in Section 4.6.3, and to the analysis of certain games and puzzles. 
Giuseppe Peano [Atti della R. Accademia delle Scienze di Torino 34 (1898), 47- 
55] used binary notation as the basis of a “logical” character set of 256 symbols. 
Joseph Bowden [Special Topics in Theoretical Arithmetic (Garden City: 1936), 
49] gave his own system of nomenclature for hexadecimal numbers. 

The book History of Binary and Other Nondecimal Numeration by Anton 
Glaser (privately printed, 1971) contains an informative and nearly complete 
discussion of the development of binary notation, including English translations 
of many of the works cited above. 

Much of the recent history of number systems is connected with the devel¬ 
opment of calculating machines. Charles Babbage’s notebooks for 1838 show 
that he considered using nondecimal numbers in his Analytical Engine [cf. M. V. 
Wilkes, Historia Math. 4 (1977), 421]. Increased interest in mechanical devices 
for arithmetic, especially for multiplication, led several people in the 1930s to 
consider the binary system for this purpose. A particularly delightful account of 
such activity appears in the article “Binary Calculation” by E. William Phillips 
[Journal of the Institute of Actuaries 67 (1936), 187-221] together with a record 
of the discussion that followed a lecture he gave on the subject. Phillips began by 
saying, “The ultimate aim (of this paper] is to persuade the wdiole civilized world 
to abandon decimal numeration and to use octonal (i.e., radix 8] numeration in 
its place.” 

Modern readers of Phillips’s article will perhaps be surprised to discover that 
a radix-8 number system was properly referred to as “octonary” or “octonal,” 
according to all dictionaries of the English language at that time, just as the 
radix-10 number system is properly called either “denary” or “decimal”; the 
word “octal” did not appear in English language dictionaries until 1961, and 
it apparently originated as a term for the “base” of a certain class of vacuum 
tubes. The word “hexadecimal,” which has crept into our language even more 
recently, is a mixture of Greek and Latin stems; more proper terms would be 
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“senidenary” or “sedecimal” or even “sexadecimal,” but the latter is perhaps too 
risque for computer programmers. 

The comment by Mr. Wales that is quoted at the beginning of this chapter 
has been taken from the discussion printed with Phillips’s paper. Another man 
who attended the same lecture objected to the octal system for business purposes: 
“5% becomes 3.i463 per 64, which sounds rather horrible.” 

Phillips got the inspiration for his proposals from an electronic circuit that 
was capable of counting in binary [C. E. Wynn-Williams, Proc. Roy. Soc. London 
A136 (1932), 312-324]. Electromechanical and electronic circuitry for general 
arithmetic operations was developed during the late 1930s, notably by John V. 
Atanasoff and George R. Stibitz in the U.S.A., L. Couffignal and R. Valtat in 
France, Helmut Schreyer and Konrad Zuse in Germany. All of these inventors 
used the binary system, although Stibitz later developed excess-3 binary-coded- 
decimal notation. A fascinating account of these early developments, including 
reprints and translations of important contemporary documents, appears in Brian 
Randell’s book The Origins of Digital Computers (Berlin: Springer, 1973). 

The first American high-speed computers, built in the early 1940s, used 
decimal arithmetic. But in 1946, an important memorandum by A. W. Burks, 
H. H. Goldstine, and J. von Neumann, in connection with the design of the 
first stored-program computers, gave detailed reasons for the decision to make 
a radical departure from tradition and to use base-two notation [see John von 
Neumann, Collected Works 5, 41-65]. Since then binary computers have multi¬ 
plied. After a dozen years of experience with binary machines, a discussion of 
the relative advantages and disadvantages of radix-2 notation was given by W. 
Buchholz in his paper “Fingers or Fists?” [CACM 2 (December 1959), 3-11]. 

The MIX computer used in this book has been defined so that it can be 
either binary or decimal. It is interesting to note that nearly all MIX programs 
can be expressed without knowing whether binary or decimal notation is being 
used—even when we are doing calculations involving multiple-precision arith¬ 
metic. Thus we find that the choice of radix does not significantly influence 
computer programming. (Noteworthy exceptions to this statement, however, are 
the “Boolean” algorithms discussed in Section 7.1; see also Algorithm 4.5.2B.) 

There are several different methods for representing negative numbers in a 
computer, and this sometimes influences the way arithmetic is done. In order to 
understand these other notations, let us first consider MIX as if it were a decimal 
computer; then each word contains 10 digits and a sign, for example 

—12345 67890. (2) 

This is called the signed-magnitude representation. Such a representation agrees 
with common notational conventions, so it is preferred by many programmers. A 
potential disadvantage is that minus zero and plus zero can both be represented, 
while they usually should mean the same number; this possibility requires some 
care in practice, although it turns out to be useful at times. 

Most mechanical calculators that do decimal arithmetic use another system 
called ten’s complement notation. If we subtract 1 from 00000 00000, we get 
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99999 99999 in this notation; in other words, no explicit sign is attached to the 
number, and calculation is done modulo 10 10 . The number —12345 67890 would 
appear as 

87654 32110 (3) 

in ten’s complement notation. It is conventional to regard any number whose 
leading digit is 5, 6, 7, 8, or 9 as a negative value in this notation, although 
with respect to addition and subtraction there is no harm in regarding (3) as the 
number -(—87654 32110 if it is convenient to do so. Note that there is no problem 
of “minus zero” in such a system. 

The major difference between signed magnitude and ten’s complement nota¬ 
tions in practice is that shifting right does not divide the magnitude by ten; for 
example, the number —11 = ... 99989, shifted right one, gives ... 99998 = —2 
(assuming that a shift to the right inserts “9” as the leading digit when the num¬ 
ber shifted is negative). In general, x shifted right one digit in ten’s complement 
notation will give |_x/10j, whether x is positive or negative. 

A possible disadvantage of the ten’s complement system is the fact that it is 
not symmetric about zero; the largest negative number representable in p digits 
is 500... 0, and it is not the negative of any p-digit positive number. Thus it is 
possible that changing x to —x will cause overflow. (See exercises 7 and 31 for 
a discussion of radix-complement notation with infinite precision.) 

Another notation that has been used since the earliest days of high-speed 
computers is called nines’ complement representation. In this case the number 
— 12345 67890 would appear as 


87654 32109. (4) 

Each digit of a negative number (— x) is equal to 9 minus the corresponding digit 
of x. It is not difficult to see that the nines’ complement notation for a negative 
number is always one less than the corresponding ten’s complement notation. 
Addition and subtraction are done modulo 10 10 — 1, which means that a carry 
off the left end is to be added at the right end. (Cf. Section 3.2.1.1.) Again 
there is a potential problem with minus zero, since 99999 99999 and 00000 00000 
denote the same value. 

The ideas just explained for radix 10 arithmetic apply in a similar way to 
radix 2 arithmetic, where we have signed magnitude, two’s complement, and 
ones’ complement notations. The MIX computer, as used in the examples of 
this chapter, deals only with signed-magnitude arithmetic; however, alternative 
procedures for complement notations are discussed in the accompanying text 
when it is important to do so. 

Most computer manuals tell us that the machine’s circuitry assumes that the 
radix point is situated in a particular place within each computer word. This 
advice should usually be disregarded. It is better to learn the rules concerning 
where the radix point will appear in the result of an instruction if we assume 
that it lies in a certain place beforehand. For example, in the case of MIX we 
could regard our operands either as integers with the radix point at the extreme 
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right, or as fractions with the radix point at the extreme left, or as some mixture 
of these two extremes; the rules for the appearance of the radix point in each 
result are straightforward. 

It is easy to see that there is a simple relation between radix b and radix b k : 
( . . . fl3&2 a l a 0- a —id —2 • * • )b = ( • • ■ A 3 A. 2 -A 1 A 0 .A_ lA_ 2 ■ • • )fefc, (5) 


where 

Aj = (^fcj + A:— 1 • • • + )&5 

see exercise 8 . Thus we have simple techniques for converting at sight between, 
say, binary and octal notation. 

Many interesting variations on positional number systems are possible be¬ 
sides the standard 6 -ary systems discussed so far. For example, we might have 
numbers in base (— 10 ), so that 

( . . . 03^2^1 tto-G— lCL —2 • • • ) —10 

= 03 (— 10) 3 -j- a,2{ — 10) 2 -(- ai( — 10) 1 + <20 * 

— ' • 4 — 1000(23 100(22 — lOflj -j- do — yfj<2—1 + TUU ^—2 — 

Here the individual digits satisfy 0 < a* < 9 just as in the decimal system. The 
number 12345 67890 appears in the “negadecimal” system as 

(1 93755 73910)_i O , (6) 

since the latter represents 10305070900 — 9070503010. It is interesting to note 
that the negative of this number, —12345 67890, would be written 

(28466 48290)_i 0 , (7) 

and, in fact, every real number whether positive or negative can be represented 
without a sign in the —10 system. 

Negative-base systems were first considered by Vittorio Griinwald [Giornale 
di matematiche di Battaglini 23 (1885), 203-221, 367], who explained how to 
perform the four arithmetic operations in such systems; Griinwald also discussed 
root extraction, divisibility tests, and radix conversion. However, since his work 
was published in a rather obscure journal, it seems to have had no effect on 
other research, and it was soon forgotten. The next publication about negative- 
base systems was apparently by A. J. Kempner [AMM 43 (1936), 610-617], 
who discussed the properties of non-integer radices and remarked in a footnote 
that negative radices would be feasible too. After twenty more years the idea 
was rediscovered again, this time by Z. Pawlak and A. Wakulicz [Bulletin de 
l’Academie Polonaise des Sciences, Classe HI, 5 (1957), 233-236; Serie des sciences 
techniques 7 (1959), 713-721], and also by L. Wadel [IRE Transactions EC-6 
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(1957), 123]. For further references see IEEE Transactions EC-12 (1963), 274- 
276; Computer Design 6 (May 1967), 52-63. There is evidence that the idea 
of negative bases occurred independently to quite a few people. For example, 
D. E. Knuth had discussed negative-base systems in 1955, together with a further 
generalization to complex-valued bases, in a short paper submitted to a “science 
talent search” contest for high-school seniors. 

The base 2 i gives rise to a system called the “quater-imaginary” number 
system (by analogy with “quaternary”), which has the unusual feature that every 
complex number can be represented with the digits 0, 1, 2, and 3 without a sign. 
[See D. E. Knuth, CACM 3 (1960), 245-247.] For example, 

(11210.31)21 = 1 • 16 + 1 • (— 8 z) -f- 2 • (—4) -|- 1 • (2 i) -[- 3 • (— \i ) -j~ 1( 3 ) 

= 7j-7jt. 

Here the number (a 2n • • • 1 • • • a — 2 k) 2 i is equal to 

(a 2n • • • a 2 flo - a —2 • • • 2k )—4 + 2 z(a 2 n _i ... a^ai .a_i... a—2k+ 1 )—4; 

so conversion to and from quater-imaginary notation reduces to conversion to and 
from negative quaternary representation of the real and imaginary parts. The 
interesting property of this system is that it allows multiplication and division 
of complex numbers to be done in a fairly unified manner without treating real 
and imaginary parts separately. For example, we can multiply two numbers in 
this system much as we do with any base, merely using a different “carry” rule: 
whenever a digit exceeds 3 we subtract 4 and “carry” —1 two columns to the 
left; when a digit is negative, we add 4 to it and “carry” +1 two columns to the 
left. A study of the following example shows this peculiar carry rule at work: 

1 2 2 3 1 [9 - lOz] 

1 2 2 3 1 [9 — lOz] 

1 2 2 3 1 
1 0 3 2 0 2 1 3 
1 3 0 2 2 
1 3 0 2 2 
1 2 2 3 1 _ 

0 2 1 3 3 3 1 2 1 [—19 — 180z] 

A similar system that uses just the digits 0 and 1 may be based on y/2 i, 
but this requires an infinite nonrepeating expansion for the simple number “i” 
itself. Vittorio Griinwald proposed using the digits 0 and l/\/2 in odd-numbered 
positions, to avoid such a problem, but this actually spoils the whole system [cf. 
Commentari dell’ Ateneo di Brescia (1886), 43-54]. 

Another “binary” complex number system may be obtained by using the 
base i — l, as suggested by W. Penney [JACM 12 (1965), 247-248]: 

(... a 4 a3a 2 a i a o - a —1 • • • )z —1 

= * • ■ — 4a 4 -\- (2+2i)a3 — 2ia 2 -j- (i —l)di &o — 1 “h * " • 
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Fig. 1 . The set S. (Illustration by P. M. Farmwald, R. W. Gosper, and R. E. Maas.) 

In this system, only the digits 0 and 1 are needed. One way to demonstrate that 
every complex number has such a representation is to consider the interesting 
set S shown in Fig. 1; this set is, by definition, all points that can be written 
as X)fc>i f° r an infinite sequence a\, <22 > a 3 > • • • °f zeros and ones. 

Figure 1 shows that 5 can be decomposed into 256 pieces congruent to ^5; 
note that if the diagram of 5 is rotated counterclockwise by 135°, we obtain 
two adjacent sets congruent to (l/y/2)S (since (i — 1)5 = 5 U (5 -f- 1)). For 
details of a proof that 5 contains all complex numbers that are of sufficiently 
small magnitude, see exercise 18. 

Perhaps the prettiest number system of all is the balanced ternary notation, 
which consists of base-3 representation using —1, 0, and +1 as “trits” (ternary 
digits) instead of 0, 1, and 2. If we use the symbol I to stand for —1, we have 
the following examples of balanced ternary numbers: 
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Balanced ternary 

Decimal 

1 01 

8 

1 1I0.IT 

32| 

TT10.11 

—321 

1110 

—33 

0.1 1 1 1 1... 

1 

2 


One way to find the representation of a number in the balanced ternary 
system is to start by representing it in ternary notation; for example, 

208.3 = (21201.022002200220... ) 3 . 

(A very simple pencil-and-paper method for converting to ternary notation is 
given in exercise 4.4-12.) Now add the infinite number ...11111.11111... in 
ternary notation; we obtain, in the above example, the infinite number 

(... 11111210012.210121012101... ) 3 . 

Finally, subtract ...11111.11111... by decrementing each digit; we get 

208.3 = (lOTTOl.lOlOlOTOlOTO... ) 3 . (8) 

This process may clearly be made rigorous if we replace the artificial infinite 
number ...11111.11111... by a number with suitably many ones. 

The balanced ternary number system has many pleasant properties: 

a) The negative of a number is obtained by interchanging 1 and T. 

b) The sign of a number is given by its most significant nonzero “trit,” and in 
general we can compare any two numbers by reading them from left to right 
and using lexicographic order, as in the decimal system. 

c) The operation of rounding to the nearest integer is identical to truncation 
(i.e., deleting everything to the right of the radix point). 

Addition in the balanced ternary system is quite simple, using the table 

TITTIIITTOOOOOOOOOl 1 111 111 1 
TTIOOOl llTTTOOOl 1 lTTTOOOlll 

Toll Ill T 0 T 0 111 I 0 T 0 1 0 111 T 0 1 0 111 llllO 

(The three inputs to the addition are the digits of the numbers to be added and 
the carry digits.) Subtraction is negation followed by addition; and multiplication 
also reduces to negation and addition, as in the following example: 

110 1 [17] 

110 1 [17] 

I 1 0 1 
110 10 
1 I 0 I _ 

0 1 1 I I 0 1 


[ 289 ] 
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Representation of numbers in the balanced ternary system is implicitly 
present in a famous mathematical puzzle, which is commonly called “Bachet’s 
problem of weights” although it was already stated by Fibonacci four centuries 
before Bachet wrote his book. [See W. Ahrens, Mathematische Unterhaltungen 
und Spiele 1 (Leipzig: Teubner, 1910), Section 3.4.] Positional number systems 
with negative digits have apparently been known for more than 1000 years in 
India; see J. Bharati, Vedic Mathematics (Delhi: Motilal Banarsidass, 1965). 
They were independently rediscovered by J. Colson [Philos. Trans. 34 (1726), 
161-173], and by Sir John Leslie [The Philosophy of Arithmetic (Edinburgh, 
1817); see pp. 33-34, 54, 64-65, 117, 150]; and also by A. Cauchy [Comptes 
Rendus 11 (Paris: 1840), 789-798], who pointed out that negative digits make it 
unnecessary for a person to memorize the multiplication table past 5X5. The 
first true appearance of “pure” balanced ternary notation was in an article by 
Leon Lalanne [Comptes Rendus 11 (Paris: 1840), 903-905], who was a designer 
of mechanical devices for arithmetic. The system was mentioned only rarely 
for 100 years after Lalanne’s paper, until the development of the first electronic 
computers at the Moore School of Electrical Engineering in 1945-1946; at that 
time it was given serious consideration along with the binary system as a possible 
replacement for the decimal system. The complexity of arithmetic circuitry for 
balanced ternary arithmetic is not much greater than it is for the binary system, 
and a given number requires only In 2/In 3 ph 63% as many digit positions for 
its representation. Discussions of the balanced ternary number system appear 
in AMM 57 (1950), 90-93, and in High-speed Computing Devices, Engineering 
Research Associates (McGraw-Hill, 1950), 287-289. The experimental Russian 
computer SETUN was based on balanced ternary notation [see CACM 3 (1960), 
149-150], and perhaps the symmetric properties and simple arithmetic of this 
number system will prove to be quite important some day—when the “flip-flop” 
is replaced by a “flip-flap-flop”. 

Positional notation generalizes in another important way to a mixed-radix 
system. Given a sequence of numbers (b n ) (where n may be negative), we define 

* • • j ^3) ^2 j 0>1 , Oq, CL — i, CL — 2 ; ■ • • 

, 63 , 6 2 , &i, fro; fr— 1 , fr— 2 , • • •_ (9) 

= • • • -b U3^2frifro + c^frifro ~b fro -j- Oo -}- a_i/fr_i a_ 2 /fr—ifr —2 • • • • 

In the simplest mixed-radix systems, we work only with integers; we let fro, fri, 
fr 2 , ... be integers greater than one, and deal only with numbers that have no 
radix point, where a n is required to lie in the range 0 < a n < fr n . 

One of the most important mixed-radix systems is the factorial number 
system, where fr n = n -j- 2. Using this system, we can represent every positive 
integer uniquely in the form 

c n ft! + Gi —1 ( n — 1 )! ~\~ ‘ * • H - c 2 2 ! -f- Ci, ( 10 ) 

where 0 < Ck < k for 1 < k < n, and c n ^ 0. (See Algorithm 3.3.2P.) 
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Mixed-radix systems are familiar in everyday life, when we deal with units 
of measure. For example, the quantity “3 weeks, 2 days, 9 hours, 22 minutes, 57 
seconds, and 492 milliseconds” is equal to 


3, 2, 9, 22, 57; 492 
7, 24, 60, 60; 1000 


seconds. 


The quantity “10 pounds, 6 shillings, and thruppence ha’penny” was once equal 
to [ 10, 2 o’ it; 2 ] P ence m British currency, before Great Britain changed to a 
purely decimal monetary system. 

It is possible to add and subtract mixed-radix numbers by using a straightfor¬ 
ward generalization of the usual addition and subtraction algorithms, provided 
of course that the same mixed-radix system is being used for both operands 
(see exercise 4.3.1-9). Similarly, we can easily multiply or divide a mixed-radix 
number by small integer constants, using simple extensions of the familiar pencil- 
and-paper methods. 

Mixed-radix systems were first discussed in full generality by Georg Cantor 
[Zeitschrift fur Math, und Physik 14 (1869), 121-128]. Exercises 26 and 29 give 
further information about them. 

Some questions concerning irrational radices have been investigated by W. 
Parry, Act a Mathematica , Acad. Sci. Hung., 11 (1960), 401-416. 

Besides the systems described in this section, several other ways to represent 
numbers are mentioned elsewhere in this series of books: the binomial number 
system (exercise 1.2.6-56); the Fibonacci number system (exercises 1.2.8-34, 
5.4.2-10); the phi number system (exercise 1.2.8-35); modular representations 
(Section 4.3.2); Gray code (Section 7.2.1); and roman numerals (Section 9.1). 


EXERCISES 

1. [15] Express —10, —9, ..., 9, 10 in the number system whose base is —2. 

► 2 . [24] Consider the following four number systems: (a) binary (signed magnitude); 
(b) negabinary (radix —2); (c) balanced ternary; and (d) radix b = Use each of 
these four number systems to express each of the following three numbers: (i) —49; 
(ii) —3y (show the repeating cycle); (iii) 1 r (to a few significant figures). 

3. [20] Express —49 -j- i in the quater-imaginary system. 

4. [15] Assume, that we have a MIX program in which location A contains a number 
for which the radix point lies between bytes 3 and 4, while location B contains a number 
whose radix point lies between bytes 2 and 3. (The leftmost byte is number 1). Where 
will the radix point be, in registers A and X, after the following instructions? 

b) LDA A 
SRAX 5 
DIV B | 


a) LDA A 

MUL B | 
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5. [00] Explain why a negative integer in nines’ complement notation has a repre¬ 
sentation in ten’s complement notation that is always one greater, if the representations 
are regarded as positive. 

6 . [16] What are the largest and smallest p-bit integers that can be represented in (a) 
signed-magnitude binary notation (including one bit for the sign), (b) two’s complement 
notation, (c) ones’ complement notation? 

7. [M20] The text defines ten’s complement notation only for integers represented 
in a single computer word. Is there a way to define a ten’s complement notation for all 
real numbers, having “infinite precision,” analogous to the text’s definition? Is there a 
similar way to define a nines’ complement notation for all real numbers? 

8 . [M10] Prove Eq. (5). 

► 9. [15] Change the following octal numbers to hexadecimal notation, using the 
hexadecimal digits 0, 1, ..., F: 12 ; 5655 ; 2550276 ; 7 6545336] 3726755. 

10. [M22] Generalize Eq. (5) to mixed-radix notation. 

11. [22] Design an algorithm that uses the —2 number system to compute the sum 
of (a n ... aiao )—2 and (b n •.. bibo)— 2 , obtaining the answer (c n +2 • ■ ■ ciCo)— 2 - 

12. [23] Specify algorithms that convert (a) the binary signed magnitude number 
i(a„ ... ^ 0)2 to its negabinary form ( 6 n +i - . • 60 )— 2 ; and (b) the negabinary number 
(b n +i ... &o )_ 2 to its signed magnitude form ±(a n +i • •. 00 ) 2 - 

► 13. [M21] In the decimal system there are some numbers with two infinite decimal 
expansions; e.g., 2.3599999 ... = 2.3600000.... Does the negadecimal (base —10) 
system have unique expansions, or are there real numbers with two different infinite 
expansions in this base also? 

14. [ 14 ] Multiply (11321)2i by itself in the quater-imaginary system using the method 
illustrated in the text. 

15. [M24] What are the sets 

s = | J2 akb ~ k 

^ k> 1 

analogous to Fig. 1, for the negative decimal and for the quater-imaginary number 
systems? 

16. [M24] Design an algorithm to add 1 to (o n ... aiao)t —1 in the i —1 number system. 

17. [M30] It may seem peculiar that i — 1 has been suggested as a number-system 
base, instead of the similar but intuitively simpler number i -(- 1 . Can every complex 
number a-\-bi, where a and b are integers, be represented in a positional number system 
to base i 1 , using only the digits 0 and 1 ? 

18. [HM32] Show that the set S of Fig. 1 is a closed set that contains a neighborhood 
of the origin. (Consequently, every complex number has a “binary” representation to 
base i — 1 .) 

► 19. [23] (David W. Matula.) Let D be a set of b integers, containing exactly one 
solution to the congruence x = j (modulo b) for 0 < j < b. Prove that all integers 
m (positive, negative, or zero) can be represented in the form m = (a n ... ao)b, where 
all the dj are in D, if and only if all integers in the range l < m < u can be so 
represented, where l = — max{ a | a £ D}/(b — 1), u — — min{ a|a£D} j{b — 1). 
For example, D = {—1,0,..., 6 — 2} satisfies the conditions for all b > 3. [Hint: 
Design an algorithm that constructs a suitable representation.] 


die an allowable digit 


igitj, 
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20. [HM28] (David W. Matula.) Consider a decimal number system that uses the 
digits D — {—1,0,8,17,26,35,44,53,62,71} instead of {0,1,...,9}. The result of 
exercise 19 implies (as in exercise 18) that all real numbers have an infinite decimal 
expansion using digits from D. 

In the usual decimal system, exercise 13 points out that some numbers have two 
representations, (a) Find a real number that has more than two D-decimal repre¬ 
sentations. (b) Show that no real number has infinitely many 2>decimal representations, 
(c) Show that uncountably many numbers have two or more D-decimal representations. 

► 21. [ M22 ] (C. E. Shannon.) Can every real number (positive, negative, or zero) be 

expressed in a “balanced decimal” system, i.e., in the form Ylk<n ak 10 fc , for some 
integer n and some sequence a n , a n _i, a n — 2 , •••> where each a~k is one of the ten 
numbers {—4^, —3§, — 2\, —1^, —^, 1^, 3^, 4^}? (Note that zero is not one of 

the allowed digits, but we implicitly assume that a n +i, ■ ■ • are zero.) Find all 

representations of zero in this number system, and find all representations of unity. 

2 

22. [HM25] Let a = — £ m >i 10~ m . Given c > 0 and any real number x, prove 
that there is a “decimal” representation such that 0 < 

where each ak is allowed to be only one of the three values 0, 1, or a. (Note that no 
negative powers of 10 are used in this representation!) 

23. [. HM30 ] Let D be a set of b real numbers such that every positive real number 

has a representation Ylk<n ak ^ k ak £ D- Exercise 20 shows that there may 

be many numbers without unique representations; but prove that the set T of all such 
numbers has measure zero. 

24. [ M35} Find infinitely many different sets D of ten nonnegative integers satisfying 
the following three conditions: (i) gcd(D) — 1 ; (ii) 0 £ D\ (iii) every positive real 
number can be represented in the form ]C fc<n ak ^ k w Eh all ak £ D. 

25. [ M25\ (S. A. Cook.) Let 5, u, and v be positive integers, where b > 2 and 
0 < v < b m . Show that the base b representation of ujv does not contain a run of 
m consecutive digits equal to b — 1 , anywhere to the right of the radix point. (By 
convention, no runs of infinitely many (b — l)’s are permitted in the standard base b 
representation.) 

► 26. [ HM30 ] (N. S. Mendelsohn.) Let (/ 3 n ) be a sequence of real numbers defined for 
all integers n, — 00 < n < 00 , such that 

(% < 0 n + 1 ; lim fin = oo; lim fi n — 0 . 


Let (c n ) be an arbitrary sequence of positive integers that is defined for all integers n, 
— 00 < n < 00. Let us say that a number x has a “generalized representation” if 
there is an integer n and an infinite sequence of integers a n , a n —i, a n ~ 2 , • • • such that 
x — Ylk<n ak Pk, where a n 7 ^ 0 , 0 < a*, < Ck, and ak < Ck for infinitely many k. 

Show that every positive real number x has exactly one generalized representation 
if and only if fi n +\ — Ylk<n Ck @ k f° r n - (Consequently, the mixed-radix systems 
with integer bases have this property; and mixed-radix systems with fix = (co -\- l)fio, 
P 2 — (ci -(- l)(co -j- l)/? 0 j • • •, fi— 1 — fio/(c—\ -f 1), ... are the most general number 
systems of this type.) 
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27. [M21] Show that every nonzero integer has a unique “reversing binary representa¬ 
tion” 

2 e ° _ 2 ei -+ (—l) t 2 e *, 

where eo < e\ < • • • < e t . 

► 28. [M24] Show that every nonzero complex number of the form a -f- hi where a and 
b are integers has a unique “revolving binary representation” 

(1 + z) e ° + t( 1 -f z) ei -(14- - i{ 1 + ^) e3 + • • • 4- i\l + i) et , 


where e 0 < t\ < • • • < e t . (Cf. exercise 27.) 

29 . [ M35 ] (N. G. de Bruijn.) Let So, Si, S 2 , ... be sets of nonnegative integers; 
we will say that the collection {So, Si, S 2 , ... } has Property B if every nonnegative 
integer n can be written in the form 


u — so 4- s i 4" 5 2 4- ■ ■ • i s j E Sj , 

in exactly one way. (Property B implies that 0 6 Sj for all j, since n — 0 can only 

be represented as 0 4 0 4 0 -|-.) Any mixed-radix number system with radices 

bo, hi, b 2 , ■ ■ ■ provides an example of a collection of sets satisfying Property B, if we 
let S-j = {0, Bj ,..., (by — 1)B?}, where By = bobi... bj~ 1 ; here the representation of 
w = so 4 s i 4 s 2 -j-corresponds in an obvious manner to its mixed-radix repre¬ 

sentation (9). Furthermore, if the collection {So, S x , S 2 , ■. has Property B, and if 
Aq, Ai, A 2 , •.. is any partition of the nonnegative integers (so that we have Ao U Ai U 
A 2 U • • • = {0,1,2,... } and A t n A, = 0 for i j ; some A/s may be empty), then 
the “collapsed” collection {To, T\, T 2 ,... } also has Property B, where Tj is the set of 
all sums Sz ^ a ^ en over possible choices of Si G Si. 

Prove that any collection {To, Ti, T 2 ,... } that satisfies Property B may be obtained 
by collapsing some collection {So, Si, S 2 ,... } that corresponds to a mixed-radix number 
system. 

30. [ M39\ (N. G. de Bruijn.) The radix/—2) number system shows us that every 
integer (positive, negative, or zero) has a unique representation of the form 

( 2) ei -j- (—2 Y 2 4" ‘ • * 4- (—2) e ‘, ei > e2 > • • * > et > 0, t > 0. 

The purpose of this exercise is to explore generalizations of this phenomenon. 

a) Let bo, bi, t> 2 , ... be a sequence of integers such that every integer n has a unique 
representation of the form 


n — b ei 4" be 2 4* bet, e l > e 2 • • • > Ct > 0, t > 0. 

(Such a sequence (b n ) is called a “binary basis.”) Show that there is an index j 
such that b y is odd, but bk is even for all k 7 ^ j. 

b) Prove that a binary basis (b n ) can always be rearranged into the form do, 2di, 4 da, 
... = ( 2 n d n ), where each dk is odd. 

c) If each of do, di, d 2 , ... in (b) is 4A, prove that (b n ) is a binary basis if and only 
if there are infinitely many 4 "l’s and infinitely many —l’s. 

d) Prove that 7, —13 • 2, 7 • 2 2 , —13 • 2 3 , ..., 7 ■ 2 2fc , —13 • 2 2fc+1 , ... is a binary 
basis, and find the representation of n = 1 . 
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► 31 . [M 35 ] A generalization of two’s complement arithmetic, called “2-adic numbers,” 
was invented about 1900 by K. Hensel. (In fact he treated p-adic numbers, for any 
prime p.) A 2 -adic number may be regarded as a binary number 

U = ( . . . U3U2U1U0.U— 1... U—n) 2, 

whose representation extends infinitely far to the left, but only finitely many places 
to the right, of the binary point. Addition, subtraction, and multiplication of 2-adic 
numbers are done according to the ordinary procedures of arithmetic, which can in 
principle be extended indefinitely to the left. For example, 

7 = (... 000000000000111)2 T = (••• 110110110110111)2 

—7 = (...111111111111001)2 — \ = (...001001001001001)2 

\ = (...000000000000001.11)2 ^ = (... 110011001100110.1)2 
\/—7 = (... 100000010110101)2 or (...011111101001011)2. 

Here 7 appears as the ordinary binary integer seven, while —7 is its two’s comple¬ 
ment (extending infinitely to the left); it is easy to verify that the ordinary procedure for 
addition of binary numbers will give —7 + 7 = (... 00000)2 = 0 , when the procedure 
is continued indefinitely. The values of 7 and — \ are the unique 2 -adic numbers that, 
when formally multiplied by 7 , give 1 and —1, respectively. The values of J and ^ 
are examples of 2-adic numbers that are not 2-adic “integers,” since they have nonzero 
bits to the right of the binary point. The two values of V — 7 , which are negatives of 
each other, are the only 2-adic numbers that, when formally squared, yield the value 
(...111111111111001)2. 

a) Prove that any 2 -adic number u can be divided by any nonzero 2 -adic number v 
to obtain a unique 2 -adic number w satisfying u = vw. (Hence the set of 2 -adic 
numbers forms a “field”; cf. Section 4 . 6 . 1 .) 

b) Prove that the 2 -adic representation of the rational number —l/( 2 n -j- 1 ) may be 

obtained as follows, when n is a positive integer: First find the ordinary binary 
expansion of -f-l/( 2 n 1 ), which has the periodic form (O.aaa... ) 2 for some 

string ol of 0 ’s and l’s. Then —l/( 2 n + 1 ) is the 2 -adic number (... 

c) Prove that the representation of a 2 -adic number u is ultimately periodic (that is, 
un+\ — un for all large N, for some X > 1 ) if and only if u is rational (that is, 
u = m/n, for some integers m and n). 

d) Prove that, when n is an integer, yjn is a 2 -adic number if and only if it satisfies 
nmod 2 2/c+3 = 2 2k for some nonnegative integer k. (Thus, the possibilities are 
either nmod8 = 1 , or n mod 32 = 4 , etc.) 

32 . [M 40 ] (I. Z. Ruzsa.) Prove that there are infinitely many integers whose ternary 
representation uses only 0’s and l’s and whose quinary representation uses only 0’s, 
l’s, and 2’s. 

33 . [M 40 ] (D. A. Klarner.) Let D be any set of integers, let b be any positive integer, 
and let k n be the number of distinct integers that can be written as n-digit numbers 
(a n _i ...aiao)t) to base b with digits a* in D. Prove that the sequence ( k n ) satisfies 
a linear recurrence relation, and explain how to compute the generating function 
Y /n k n z n . Illustrate your algorithm in the case 6 = 3 and D = {— 1 , 0 , 3 }. 
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4.2. FLOATING POINT ARITHMETIC 

In this section, we shall study the basic principles of doing arithmetic on 
“floating point” numbers, by analyzing the internal mechanisms underlying such 
calculations. Perhaps many readers will have little interest in this subject, since 
their computers either have built-in floating point instructions or their computer 
manufacturer has supplied suitable subroutines. But, in fact, the material of 
this section should not merely be the concern of computer-design engineers or of 
a small clique of people who write library subroutines for new machines; every 
well-rounded programmer ought to have a knowledge of what goes on during the 
elementary steps of floating point arithmetic. This subject is not at all as trivial 
as most people think; it involves a surprising amount of interesting information. 


4.2.1. Single-Precision Calculations 

A. Floating point notation. We have discussed “fixed point” notation for numbers 
in Section 4.1; in such a case the programmer knows where the radix point 
is assumed to lie in the numbers he manipulates. For many purposes it is 
considerably more convenient to let the position of the radix point be dynamically 
variable or “floating” as a program is running, and to carry with each number an 
indication of its current radix point position. This idea has been used for many 
years in scientific calculations, especially for expressing very large numbers like 
Avogadro’s number N = 6.02252 X 10 23 , or very small numbers like Planck’s 
constant h — 1.0545 X 10 —27 erg sec. 

In this section we shall work with base b, excess q, floating point numbers 
with p digits: Such numbers will be represented by pairs of values (e, /), denoting 

(e,/) = /X6 e -T (1) 

Here e is an integer having a specified range, and / is a signed fraction. We will 
adopt the convention that 

l/l < i; 

in other words, the radix point appears at the left of the positional representation 
of /. More precisely, the stipulation that we have p -digit numbers means that 
b p f is an integer, and that 


—b p < b p f < b p . (2) 

The term “floating binary” implies that 6 = 2, “floating decimal” implies 6 = 
10, etc. Using excess-50 floating decimal numbers with 8 digits, we can write, 
for example, 

Avogadro’s number N = (74, +.60225200); 

Planck’s constant h — (24, +.10545000). 

The two components e and / of a floating point number are called the 
exponent and the fraction parts, respectively. (Other names are occasionally 
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used for this purpose, notably “characteristic” and “mantissa”; but it is an abuse 
of terminology to call the fraction part a mantissa, since this concept has quite a 
different meaning in connection with logarithms. Furthermore the English word 
mantissa means “a worthless addition.”) 

The MIX computer assumes that its floating point numbers have the form 

( 4 ) 

Here we have base b, excess q, floating point notation with four bytes of precision, 
where b is the byte size (e.g., b — 64 or b = 100), and q is equal to [^bj. The 
fraction part is i / / / /? and e is the exponent, which lies in the range 0 < 
e < 6. This internal representation is typical of the conventions in most existing 
computers, although 6 is a much larger base than usual. 

B. Normalized calculations. A floating point number (e, /) is normalized if the 
most significant digit of the representation of / is nonzero, so that 

1/b < |/| < 1; (5) 

or if / — 0 and e has its smallest possible value. It is possible to tell which of 
two normalized floating point numbers has a greater magnitude by comparing 
the exponent parts first, and then testing the fraction parts only if the exponents 
are equal. 

Most floating point routines now in use deal almost entirely with normalized 
numbers: inputs to the routines are assumed to be normalized, and the outputs 
are always normalized. Under these conventions we lose the ability to represent 
a few numbers of very small magnitude—for example, the value (0, .00000001) 
can’t be normalized without producing a negative exponent—but we gain in 
speed, uniformity, and the ability to give relatively simple bounds on the relative 
error in our computations. (Unnormalized floating point arithmetic is discussed 
in Section 4.2.2.) 

Let us now study the normalized floating point operations in detail. At the 
same time we can consider the construction of subroutines for these operations, 
assuming that we have a computer without built-in floating point hardware. 

Machine-language subroutines for floating point arithmetic are usually writ¬ 
ten in a very machine-dependent manner, using many of the wildest idiosyn¬ 
crasies of the computer at hand; so floating point addition subroutines for two 
different machines usually bear little superficial resemblance to each other. Yet 
a careful study of numerous subroutines for both binary and decimal computers 
reveals that these programs actually have quite a lot in common, and it is possible 
to discuss the topics in a machine-independent way. 

The first (and by far the most difficult!) algorithm we shall discuss in this 
section is a procedure for floating point addition, 



(Ut> fu ) (B (^vj fv ) — (^wj fu >)• 


(6) 
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Note: Since floating point arithmetic is inherently approximate, not exact, we 
will use “round” symbols 

©> ©> 0 

to stand for floating point addition, subtraction, multiplication, and division, 
respectively, in order to distinguish approximate operations from the true ones. 

The basic idea involved in floating point addition is fairly simple: Assuming 
that e u > e v , we take e w = e u , f w = f u ~\~ fv/b 6u ~' 6v (thereby aligning the radix 
points for a meaningful addition), and normalize the result. Several situations 
can arise that make this process nontrivial, and the following algorithm explains 
the method more precisely. 

Algorithm A (Floating point addition ). Given base b, excess q, p-digit, normalized 
floating point numbers u = (e u , fu) and v = (e v , f v ), this algorithm forms the 
sum w = u($v. The same procedure may be used for floating point subtraction, 
if — v is substituted for v. 

Al. [Unpack.] Separate the exponent and fraction parts of the representations 
of u and v. 

A2. [Assume e u > e v ] If e u < e V} interchange u and v. (In many cases, it is 
best to combine step A2 with step Al or with some of the later steps.) 

A3. [Set e w .} Set e w <— e u . 

A4. [Test e u —e v .] If e u —e v > p+2 (large difference in exponents), set f w <— f u 
and go to step AT. (Actually, since we are assuming that u is normalized, 
we could terminate the algorithm; but it is occasionally useful to be able to 
normalize a possibly unnormalized number by adding zero to it.) 

A5. [Scale right.] Shift f v to the right e u — e v places; i.e., divide it by b eu ~ €v . 
[Note: This will be a shift of up to p + 1 places, and the next step (which 
adds f u to f v ) thereby requires an accumulator capable of holding 2p —[— 1 
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Fig. 3. Normalization of (e, /). 

base-6 digits to the right of the radix point. If such a large accumulator is 
not available, it is possible to shorten the requirement to p 2 or p -f 3 
places if proper precautions are taken; the details are given in exercise 5.] 

A6. [Add.] Set f w <- f u + f v . 

A7. [Normalize.] (At this point (e^, f w ) represents the sum of u and v, but \f w \ 
may have more than p digits, and it may be greater than unity or less than 
1/6.) Perform Algorithm N below, to normalize and round (e W) f w ) into the 
final answer. | 

Algorithm N ( Normalization ). A “raw exponent” e and a “raw fraction” / are 

converted to normalized form, rounding if necessary to p digits. This algorithm 

assumes that |/| < 6. 

Nl. [Test /.] If |/| > 1 (“fraction overflow”), go to step N4. If / = 0, set e to 
its lowest possible value and go to step N7. 

N2. [Is / normalized?] If |/| > 1/6, go to step N5. 

N3. [Scale left.] Shift / to the left by one digit position (i.e., multiply it by 6), 
and decrease e by 1. Return to step N2. 

N4. [Scale right.] Shift / to the right by one digit position (i.e., divide it by 6), 
and increase e by 1. 

N5. [Round.] Round / to p places. (We take this to mean that / is changed 
to the nearest multiple of 6“ p . It is possible that (6 p /)modl = \ so that 
there are two nearest multiples; if b is even, we choose the one that makes 
b p f -P J6 odd. Further discussion of rounding appears in Section 4.2.2.) It is 
important to note that this rounding operation can make |/| — 1 (“rounding 
overflow”); in such a case, return to step N4. 

N6. [Check e.j If e is too large, i.e., larger than its allowed range, an exponent 
overflow condition is sensed. If e is too small, an exponent underflow 
condition is sensed. (See the discussion below; since the result cannot 
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be expressed as a normalized floating point number in the required range, 
special action is necessary.) 

N7. [Pack.] Put e and / together into the desired output representation. | 

Some simple examples of floating point addition are given in exercise 4. 

The following MIX subroutines, for addition and subtraction of numbers 
having the form (4), show how Algorithms A and N can be expressed as computer 
programs. The subroutines below are designed to take one input u from symbolic 
location ACC, and the other input v comes from register A upon entrance to the 
subroutine. The output w appears both in register A and location ACC. Thus, a 
fixed point coding sequence 

LDA A; ADD B; SUB C; STA D (7) 

would correspond to the floating point coding sequence 

LDA A, STA ACC; LDA B, JMP FADD; LDA C, JMP FSUB; STA D. (8) 


Program A ( Addition, subtraction , and normalization). The following program 
is a subroutine for Algorithm A, and it is also designed so that the normalization 
portion can be used by other subroutines that appear later in this section. In 
this program and in many others throughout this chapter, OFLO stands for a 
subroutine that prints out a message to the effect that Mix’s overflow toggle was 
unexpectedly found to be “on.” The byte size b is assumed to be a multiple 
of 4. The normalization routine NORM assumes that rI2 = e and rAX = /, where 
rA = 0 implies rX = 0 and rI2 < b. 


00 

BYTE 

EQU 

1(4:4) 

Byte size b 

01 

EXP 

EQU 

1:1 

Definition of exponent field 

02 

FSUB 

STA 

TEMP 

Floating point subtraction subroutine: 

03 


LDAN 

TEMP 

Change sign of operand. 

04 

FADD 

STJ 

EXITF 

Floating point addition subroutine: 

05 


JOV 

OFLO 

Ensure overflow is off. 

06 


STA 

TEMP 

TEMP «- v. 

07 


LDX 

ACC 

rX <— u. 

08 


CMPA 

ACC(EXP) 

Steps Al. A2. A3 are combined here: 

09 


JGE 

IF 

Jump if e v > e u . 

10 


STX 

FU (0:4) 

FU«-±////0. 

11 


LD2 

ACC(EXP) 

rI2 < €w 

12 


STA 

FV (0:4) 


13 


LD1N 

TEMP(EXP) 

rll <- e v . 

U 


JMP 

4F 


15 

1H 

STA 

FU (0:4) 

FU <— i / / / / 0 (u, u interchanged). 

16 


LD2 

TEMP (EXP) 

rI2 <— e w . 

17 


STX 

FV (0:4) 


18 


LD1N 

ACC(EXP) 

rll 4 - e v . 
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1 



19 

4 H 

INC 1 

0,2 

20 

5 H 

LDA 

FV 

21 


ENTX 

0 

22 


SRAX 

0,1 

23 

6 H 

ADD 

FU 

24 


JOV 

N 4 

25 


JXZ 

NORM 

26 


CMPA 

=0=(1:1) 

21 


JNE 

N 5 

28 


SRC 

5 

29 


DECX 

1 

30 


STA 

TEMP 

31 


STA 

HALF( 0 : 0 ) 

32 


LDAN 

TEMP 

33 


ADD 

HALF 

34 


ADD 

HALF 

35 


SRC 

4 

36 


JMP 

N 3 A 

37 

HALF 

CON 

1//2 

38 

FU 

CON 

0 

39 

FV 

CON 

0 

40 

NORM 

JAZ 

ZRO 

41 

N 2 

CMPA 

=0=(1:1) 

42 


JNE 

N 5 

43 

N 3 

SLAX 

1 

44 

N 3 A 

DEC 2 

1 

45 


JMP 

N 2 

46 

N 4 

ENTX 

1 

47 


SRC 

1 

48 


INC 2 

1 

49 

N 5 

CMPA 

=BYTE/ 2 =( 5 : 5 ) 

50 


JL 

N6 

51 


JG 

5 F 

52 


JXNZ 

5 F 

53 


STA 

TEMP 

54 


LDX 

TEMP( 4 : 4 ) 

55 


JXO 

N6 

56 

5 H 

STA 

*+l(0:0) 

57 


INCA 

BYTE 

58 


JOV 

N 4 

59 

N6 

J 2 N 

EXPUN 

60 

N 7 

ENTX 

0,2 

61 


SRC 

1 

62 

ZRO 

DEC 2 

BYTE 

63 

8 H 

STA 

ACC 

64 

EXITF 

J 2 N 

* 

65 

EXPOV 

HLT 

2 

66 

EXPUN 

HLT 

1 

67 

ACC 

CON 

0 


rll «— e u — e v . (Step A4 unnecessary.) 
A5. Scale right. 

Clear rX. 

Shift right e u — e v places. 

A6. Add. 

A7. Normalize. Jump if fraction overflow. 
Easy case? 

Is / normalized? 

If so, round it. 

|rX| jrA|. 

(rX is positive.) 

(Operands had opposite signs, 
registers must be adjusted 
before rounding and normalization.) 

Complement least significant portion. 
Jump into normalization routine. 

One half the word size (Sign varies) 
Fraction part f u 
Fraction part f v 

Nl. Test f. 

N2. Is f normalized? 

To N5 if leading byte nonzero. 

N3. Scale left. 

Decrease e by 1. 

Return to N2. 

N4. Scale rig ht. 

Shift right, insert “ 1 ” with proper sign. 
Increase e by 1. 

N5. Round. 

Is jtail| < ^b? 


To N 6 if rX is odd. 

Store sign of rA. 

Add b ~ 4 to j/]. (Sign varies) 
Check for rounding overflow. 

N6. Check e. Underflow if e < 0. 
N7. Pack. rX <— e. 


rI2 e — b. 

Exit, unless e > b. 

Exponent overflow detected 
Exponent underflow detected 
Floating point accumulator | 


Is |tail| > ^b? 

|tail| — 5 b; round to odd. 
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The rather long section of code from lines 25 to 37 is needed because MIX has 
only a 5-byte accumulator for adding signed numbers while in general 2p+l = 9 
places of accuracy are required by Algorithm A. The program could be shortened 
to about half its present length if we were willing to sacrifice a little bit of its 
accuracy, but we shall see in the next section that full accuracy is important. 
Line 55 uses a nonstandard MIX instruction defined in Section 4.5.2. The running 
time for floating point addition and subtraction depends on several factors that 
are analyzed in Section 4.2.4. 

Now let us consider multiplication and division, which are simpler than 
addition, and which are somewhat similar to each other. 

Algorithm M ( Floating point multiplication or division ). Given base b, excess q, 
p-digit, normalized floating point numbers u = (e u , fu) and v = {e v ,f v ), this 
algorithm forms the product w = u ® v or the quotient w = u 0 v. 

Ml. [Unpack.] Separate the exponent and fraction parts of the representations of 
u and v. (Sometimes it is convenient, but not necessary, to test the operands 
for zero during this step.) 

M2. [Operate.] Set 

e w <- e u + e v — q, f w +- f u f v for multiplication; 

e w +- e u — e v -f- q + 1, f w <- (b~ x f u )l fv for division. 

(Since the input numbers are assumed to be normalized, it follows that either 
f w = o, or \/b 2 < \f w \ < 1, or a division-by-zero error has occurred.) If 
necessary, the representation of f w may be reduced to p -f- 2 or p 3 digits 
at this point, as in exercise 5. 

M3. [Normalize.] Perform Algorithm N on (e w , f w ) to normalize, round, and 
pack the result. (Note: Normalization is simpler in this case, since scaling 
left occurs at most once, and since rounding overflow cannot occur after 
division.) | 

The following MIX subroutines, which are intended to be used in connection 
with Program A, illustrate the machine considerations necessary in connection 
with Algorithm M. 

Program M (Floating point multiplication and division). 


01 

Q 

EQU 

BYTE/2 

q is half the byte size 

02 

FMUL 

STJ 

EXITF 

Floating point multiplication subroutine: 

03 


JOV 

0FL0 

Ensure overflow is off. 

04 


STA 

TEMP 

TEMP <- v. 

05 


LDX 

ACC 

rX 4— u. 

06 


STX 

FU (0:4) 

FU-±////0. 

07 


LD1 

TEMP (EXP) 


08 


LD2 

ACC(EXP) 


09 


INC2 

-Q,l 

rI2 <— e u Gv — q• 

10 


SLA 

1 


11 


MUL 

FU 

Multiply f u times f v . 

12 


JMP 

NORM 

Normalize, round, and exit. 
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13 

FDIV 

STJ 

EXITF 

U 


JOV 

0FL0 

15 


STA 

TEMP 

16 


STA 

FV (0:4) 

17 


LD1 

TEMP(EXP) 

18 


LD2 

ACC(EXP) 

19 


DEC2 

-Q,l 

20 


ENTX 

0 

21 


LDA 

ACC 

22 


SLA 

1 

23 


CMPA 

FV (1:5) 

24 


JL 

*+3 

25 


SRA 

1 

26 


INC2 

1 

21 


DIV 

FV 

28 


JNOV 

NORM 

29 

DVZRO 

HLT 

3 


Floating point division subroutine: 
Ensure overflow is off. 

TEMP 4 - v. 

FV «- ± / / / / 0. 

rI2 <- e u — e v + q. 

rA <- f u . 

Jump if |/ tt | < |/„|. 

Otherwise, scale f u right 
and increase rI2 by 1. 

Divide. 

Normalize, round, and exit. 
Unnormalized or zero divisor | 


The most noteworthy feature of this program is the provision for division 
in lines 23-26, which is made in order to ensure enough accuracy to round the 
answer. If \f u \ < \f v \, straightforward application of Algorithm M would leave 
a result of the form “± 0 / / / /” in register A, and this would not allow a 
proper rounding without a careful analysis of the remainder (which appears in 
register X). So the program computes f w <— f u /f v in this case, ensuring that f w 
is either zero or normalized in all cases; rounding can proceed with five significant 
bytes, possibly testing whether the remainder is zero. 


We occasionally need to convert values between fixed and floating point 
representations. A “fix-to-float” routine is easily obtained with the help of the 
normalization algorithm above; for example, in MIX, the following subroutine 
converts an integer to floating point form: 


01 

FLOT STJ 

EXITF 

Assume that rA = u, an integer. 

02 

JOV 

0FL0 

Ensure overflow is off. 

03 

ENT2 

Q+5 

Set raw exponent. 

04 

ENTX 

0 


05 

JMP 

NORM 

Normalize, round, and exit. | 


( 10 ) 


A “float-to-fix” subroutine is the subject of exercise 14. 

The debugging of floating point subroutines is usually a difficult job, since 
there are so many cases to consider. Here is a list of common pitfalls that often 
trap a programmer or machine designer who is preparing floating point routines: 

1) Losing the sign. On many machines (not MIX), shift instructions between 
registers will affect the sign, and the shifting operations used in normalizing and 
scaling numbers must be carefully analyzed. The sign is also lost frequently when 
minus zero is present. (For example, Program A is careful to retain the sign of 
register A in lines 30-34. See also exercise 6 .) 
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2) Failure to treat exponent underflow or overflow properly. The size of e w 
should not be checked until after the rounding and normalization, because 
preliminary tests may give an erroneous indication. Exponent underflow and 
overflow can occur on floating point addition and subtraction, not only during 
multiplication and division; and even though this is a rather rare occurrence, it 
must be tested each time. Enough information should be retained so that mean¬ 
ingful corrective actions are possible after overflow or underflow has occurred. 

It has unfortunately become customary in many instances to ignore exponent 
underflow and simply to set underflowed results to zero with no indication of 
error. This causes a serious loss of accuracy in most cases (indeed, it is the 
loss of all the significant digits), and the assumptions underlying floating point 
arithmetic have broken down, so the programmer really must be told when 
underflow has occurred. Setting the result to zero is appropriate only in certain 
cases when the result is later to be added to a significantly larger quantity. 
When exponent underflow is not detected, we find mysterious situations in which 
(u ® v) 0 w is zero, but u ® (v £x) w) is not, since v results in exponent 
underflow but u 0 (v 0 w) can be calculated without any exponents falling out 
of range. Similarly, we can find positive numbers a, b, c, d, and y such that 

(a®y © b)0(c<g>y 0 d) « §, 

( 11 ) 

(a 0 b 0 y) 0 (C © d 0 y) = 1 

if exponent underflow is not detected. (See exercise 9.) Even though floating 
point routines are not precisely accurate, such a disparity as (11) is certainly 
unexpected when a, b , c, d , and y are all positive ! Exponent underflow is usually 
not anticipated by a programmer, so he needs to be told about it.* 

3) Inserted garbage. When scaling to the left it is important to keep from 
introducing anything but zeros at the right. For example, note the “ENTX 0” 
instruction in line 21 of Program A, and the all-too-easily-forgotten “ENTX 0” 
instruction in line 04 of the FLOT subroutine (10). (But it would be a mistake to 
clear register X after line 27 in the division subroutine.) 

*On the other hand, it must be admitted that today’s high-level programming languages give 
the programmer little or no satisfactory way to make use of the information that a floating 
point routine wants to tell him; and the MIX programs in this section, which simply ‘HLT’ 
when errors are detected, are even worse. There are numerous important applications in which 
exponent underflow is relatively harmless, and it is desirable to And a way for programmers 
to cope with such situations easily and safely. The practice of silently replacing underflows by 
zero has been thoroughly discredited, but there is another alternative that has recently been 
gaining much favor, namely to modify the definition we have given for floating point numbers, 
allowing an unnormalized fraction part when the exponent has its smallest possible value. This 
idea of “gradual underflow,” which was first embodied in the hardware of the Electrologica X8 
computer, adds only a small amount of complexity to the algorithms, and it makes exponent 
underflow impossible during addition or subtraction. The simple formulas for relative error 
in Section 4.2.2 no longer hold in the presence of gradual underflow, so the topic is beyond 
the scope of this book. However, by using formulas like round(x) = x(l — 5) -\- e, where |<5| < 
p and |e| < ^b~ p ~ q , one can show that gradual underflow succeeds in many important 
cases. See W. M. Kahan and J. Palmer, ACM SIGNUM Newsletter (Oct. 1979), 13-21. 
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4) Unforeseen rounding overflow. When a number like .999999997 is rounded 
to 8 digits, a carry will occur to the left of the decimal point, and the result must 
be scaled to the right. Many people have mistakenly concluded that rounding 
overflow is impossible during multiplication, since they look at the maximum 
value of \f u fv\, which is 1 — 2 b~ v -(- b~ 2p ; and this cannot round up to 1. The 
fallacy in this reasoning is exhibited in exercise 11. Curiously, it turns out that 
the phenomenon of rounding overflow is impossible during floating point division 
(see exercise 12). 

There is a school of thought that says it is harmless to “round” a value like 
.999999997 to .99999999 instead of to 1.0000000, since this does not increase 
the worst-case bounds on relative error. The floating point number 1.0000000 
may be said to represent all real values in the interval [1.0000000 — 5 X 10 8 , 
1.0000000 + 5 X 10 8 ], while .99999999 represents all values in the much smaller 
interval (.99999999 — 5 X 10' 9 , .99999999 + 5 X 10- Q ). Even though the 
latter interval does not contain the original value .999999997, each number of 
the second interval is contained in the first, so subsequent calculations with the 
second interval are no less accurate than with the first. This ingenious argument 
is, however, incompatible with the mathematical philosophy of floating point 
arithmetic expressed in Section 4.2.2. 

5) Rounding before normalizing. Inaccuracies are caused by premature round¬ 
ing in the wrong digit position. This error is obvious when rounding is being done 
to the left of the appropriate position; but it is also dangerous in the less obvious 
cases where rounding is first done too far to the right, followed by rounding in the 
true position. For this reason it is a mistake to round during the “scaling-right” 
operation in step A5, except as prescribed in exercise 5. (The special case of 
rounding in step N5, then rounding again after rounding overflow has occurred, 
is harmless, however, because rounding overflow always yields +1.0000000 and 
this is unaffected by the subsequent rounding process.) 

6 ) Failure to retain enough precision in intermediate calculations. Detailed 
analyses of the accuracy of floating point arithmetic, made in the next section, 
suggest strongly that normalizing floating point routines should always deliver 
a properly rounded result to the maximum possible accuracy. There should 
be no exceptions to this dictum, even in cases that occur with extremely low 
probability; the appropriate number of significant digits should be retained 
throughout the computations, as stated in Algorithms A and M. 

C. Floating point hardware. Nearly every large computer intended for scientific 
calculations includes floating point arithmetic as part of its repertoire of built-in 
operations. Unfortunately, the design of such hardware usually includes some 
anomalies that result in dismally poor behavior in certain circumstances, and we 
hope that future computer designers will pay more attention to providing the 
proper behavior than they have in the past. It costs only a little more to build the 
machine right, and considerations in the following section show that substantial 
benefits will be gained. Yesterday’s compromises are no longer appropriate for 
modern machines, based on what we know now. 




208 ARITHMETIC 


4.2.1 


The MIX computer, which is being used as an example of a “typical” machine 
in this series of books, has an optional “floating point attachment” (available at 
extra cost) that includes the following seven operations: 

• FADD, FSUB, FMUL, FDIV, FLOT, FCMP (C = 1, 2, 3, 4, 5, 56, respectively; F = 6). 
The contents of rA after the operation “FADD V” are precisely the same as the 
contents of rA after the operations 

STA ACC 
LDA V 
JMP FADD 


where FADD is the subroutine that appears earlier in this section, except that both 
operands are automatically normalized before entry to the subroutine if they are 
not already in normalized form. (If exponent underflow occurs during this pre- 
normalization, but not during the normalization of the answer, no underflow is 
signalled.) Similar remarks apply to FSUB, FMUL, and FDIV. The contents of rA 
after the operation “FLOT” are the contents after “JMP FLOT” in the subroutine 
(10) above. 

The contents of rA are unchanged by the operation “FCMP V”; this in¬ 
struction sets the comparison indicator to less, equal, or greater, depending on 
whether the contents of rA are “definitely less than,” “approximately equal to,” 
or “definitely greater than” V; this subject is discussed in the next section, and 
the precise action is defined by the subroutine FCMP of exercise 4.2.2-17 with 
EPSILON in location 0. 

No register other than rA is affected by any of the floating point operations. 
If exponent overflow or underflow occurs, the overflow toggle is turned on and 
the exponent of the answer is given modulo the byte size. Division by zero leaves 
undefined garbage in rA. Execution times: 4 u, 4u, 9 u, 11 u, 3 u, 4 u, respectively. 

• FIX (C = 5; F = 7). The contents of rA are replaced by the integer “round(rA)”, 
rounding to the nearest integer as in step N5 of Algorithm N. However, if this 
answer is too large to fit in the register, the overflow toggle is set on and the 
result is undefined. Execution time: 3 u. 

Sometimes it is helpful to use floating point operators in a nonstandard way. 
For example, if the operation FLOT had not been included as part of Mix’s floating 
point attachment, we could easily achieve its effect on 4-byte numbers by writing 


FLOT STJ 9F 
SLA 1 
ENTX Q+4 
SRC 1 
FADD =0= 

9H JMP * | 


( 12 ) 


This routine is not strictly equivalent to the FLOT operator, since it assumes that 
the 1:1 byte of rA is zero, and it destroys rX. The handling of more general 
situations is a little tricky, because rounding overflow can occur even during a 
FLOT operation. 
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Similarly, suppose MIX had a FADD operation but not FIX. If we wanted to 
round a number u from floating point form to the nearest fixed point integer, 
and if we knew that the number was nonnegative and would fit in at most three 
bytes, we could write 

FADD FUDGE 

where location FUDGE contains the constant 


+ 

Q+4 

1 

0 

0 

0 


the result in rA would be 


+ 

Q+4 

1 

I 1 

round(u) 

_i_ i _ 


(13) 


D. History and bibliography. The origins of floating point notation can be traced 
back to Babylonian mathematicians (1800 B.C. or earlier), who made extensive 
use of radix-60 floating point arithmetic but did not have a notation for the 
exponents. The appropriate exponent was always somehow “understood” by 
the man doing the calculations. At least one case has been found in which 
the wrong answer was given because addition was performed with improper 
alignment of the operands, but such examples are very rare; see 0. Neugebauer, 
The Exact Sciences in Antiquity (Princeton, N. J.: Princeton University Press, 
1952), 26-27. Another early contribution to floating point notation is due to 
the Greek mathematician Apollonius (3rd century B.C.), who apparently was 
the first to explain how to simplify multiplication by collecting powers of 10 
separately from their coefficients, at least in simple cases. [For a discussion of 
Apollonius’s method, see Pappus, Mathematical Collections (4th century A.D.).] 
After the Babylonian civilization died out, the first significant uses of floating 
point notation for products and quotients did not emerge until much later, about 
the time logarithms were invented (1600) and shortly afterwards when Oughtred 
invented the slide rule (1630). The modern notation “x n ” for exponents was 
being introduced at about the same time; separate symbols for x squared, x 
cubed, etc., had been in use before this. 

Floating point arithmetic was incorporated into the design of some of the ear¬ 
liest computers. It was independently proposed by Leonardo Torres y Quevedo 
in Madrid, 1914; by Konrad Zuse in Berlin, 1936; and by George Stibitz in 
New Jersey, 1939. Zuse’s machines used a floating binary representation that he 
called “semi-logarithmic notation”; he also incorporated conventions for dealing 
with special quantities like “oo” and “undefined.” The first American computers 
to operate with floating point arithmetic hardware were the Bell Laboratories’ 
Model V and the Harvard Mark 13, both of which were relay calculators designed 
in 1944. [See B. Randell, The Origins of Digital Computers (Berlin: Springer, 
1973), 100, 155, 163-164, 259-260; Proc. Symp. Large-Scale Digital Calculating 
Machinery (Harvard, 1947), 41-68, 69-79; Datamation 13 (April 1967), 35-44 
(May 1967), 45-49; Zeit. fur angew. Math, und Physik 1 (1950), 345-346.] 
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The use of floating binary arithmetic was seriously considered in 1944-1946 
by researchers at the Moore School in their plans for the first electronic digital 
computers, but it turned out to be much harder to implement floating point 
circuitry with tubes than with relays. The group realized that scaling was a 
problem in programming; but at the time it was only a very small part of 
a total programming job, and it seemed to be worth the time and trouble it 
took, since it tended to keep a programmer aware of the numerical accuracy 
he was getting. Furthermore, they argued that floating point representation 
would take up valuable memory space, since the exponents must be stored, and 
that it would be difficult to adapt floating point arithmetic to multiple-precision 
calculations. [See von Neumann’s Collected Works 5 (New York: Macmillan, 
1963), 43, 73-74.] At this time, of course, they were designing the first stored- 
program computer and the second electronic computer, and their choice had to 
be either fixed point or floating point arithmetic, not both. They anticipated 
the coding of floating binary routines, and in fact “shift left” and “shift right” 
instructions were put into their machine primarily to make such routines more 
efficient. The first machine to have both kinds of arithmetic in its hardware 
was apparently a computer developed at General Electric Company [see Proc. 
2nd Symp. Large-Scale Digital Calculating Machinery (Cambridge: Harvard 
University Press, 1951), 65-69]. 

Floating point subroutines and interpretive systems for early machines were 
coded by D. J. Wheeler and others, and the first publication of such routines 
was in The Preparation of Programs for an Electronic Digital Computer by 
Wilkes, Wheeler, and Gill (Reading, Mass.: Addison-Wesley, 1951), subroutines 
Al-All, pp. 35-37, 105-117. It is interesting to note that floating decimal 
subroutines are described here, although a binary computer was being used; in 
other words, the numbers were represented as 10 e /, not 2 e f, and therefore the 
scaling operations required multiplication or division by 10. On this particular 
machine such decimal scaling was about as easy as shifting, and the decimal 
approach greatly simplified input/output conversions. 

Most published references to the details of floating point arithmetic routines 
are scattered in “technical memorandums” distributed by various computer man¬ 
ufacturers, but there have been occasional appearances of these routines in the 
open literature. Besides the reference above, the following are of historical 
interest: R. H. Stark and D. B. MacMillan, Math. Comp. 5 (1951), 86-92, where 
a plugboard-wired program is described; D. McCracken, Digital Computer Pro¬ 
gramming (New York: Wiley, 1957), 121-131; J. W. Carr HI, CACM 2,5 (May 
1959), 10-15; W. G. Wadey, JACM 7 (1960), 129-139; D. E. Knuth, JACM 8 
(1961), 119-128; O. Kesner, CACM 5 (1962), 269-271; F. P. Brooks and K. E. 
Iverson, Automatic Data Processing (New York: Wiley, 1963), 184-199. For a 
discussion of floating point arithmetic from a computer designer’s standpoint, see 
“Floating point operation” by S. G. Campbell, in Planning a computer System, 
ed. by W. Buchholz (New York: McGraw-Hill, 1962), 92-121. A set of algorithms 
by J. Coonen, W. M. Kahan, and H. S. Stone, submitted to the IEEE Micro¬ 
processor Floating-Point Standards Committee during 1978-1980, represented 
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the state of the floating point art as of 1980; these carefully considered procedures 
will probably be published some day. Additional references, which deal primarily 
with the accuracy of floating point methods, are given in Section 4.2.2. 


EXERCISES 


1. [10] How would Avogadro’s number and Planck’s constant be represented in base 
100, excess 50, four-digit floating point notation? (This would be the representation 
used by MIX, as in (4), if the byte size is 100.) 

2 . [12] Assume that the exponent e is constrained to lie in the range 0 < e < E\ 
what are the largest and smallest positive values that can be written as base 6, excess q, 
p-digit floating point numbers? What are the largest and smallest positive values that 
can be written as normalized floating point numbers with these specifications? 

3. [11] (K. Zuse, 1936.) Show that if we are using normalized floating binary 
arithmetic, there is a way to increase the precision slightly without loss of memory 
space: A p-bit fraction part can be represented using only p — 1 bit positions of a 
computer word, if the range of exponent values is decreased very slightly. 

► 4. [12] Assume that b = 10, p = 8. What result does Algorithm A give for 
(50, +.98765432) 0 (49, +.33333333)? For (53, —.99987654) 0 (54, +.10000000)? For 
(45, -.50000001) 0 (54, +.10000000)? 

► 5 . [ 24 ] Let us say that x ~ y (with respect to a given radix b ) if x and y are real 
numbers satisfying the following conditions: 


[x/b\ = [y/b\] 

x mod 6 = 0 iff y mod 6 = 0; 

0 < x mod 6 <|6 iff 0 < pmod6 < ^6; 

x mod 6=56 iff pmod6=^6; 

^6 < xmod6 <6 iff 56 < ymodb < 6. 


Prove that if f v is replaced by b~ p ~ 2 F v between steps A5 and A6 of Algorithm A, 
where F v ~ b p+2 f v , the result of that algorithm will be unchanged. (If F v is an 
integer and 6 is even, this operation essentially truncates f v to p + 2 places while 
remembering whether any nonzero digits have been dropped, thereby minimizing the 
length of register that is needed for the addition in step A6.) 

6. [20] If the result of a FADD instruction is zero, what will be the sign of rA, according 
to the definitions of Mix’s floating point attachment given in this section? 

7. [21] Discuss floating point arithmetic using balanced ternary notation. 

8. [20] Give examples of normalized eight-digit floating decimal numbers u and v 
for which addition yields (a) exponent underflow, (b) exponent overflow, assuming 
that exponents must satisfy 0 < e < 100. 

9. [M24] (W. M. Kahan.) Assume that the occurrence of exponent underflow causes 
the result to be replaced by zero, with no error indication given. Using excess zero, 
eight-digit floating decimal numbers with e in the range —50 < e < 50, find positive 
values of a, 6, c, d, and y such that (11) holds. 

10. [12] Give an example of normalized eight-digit floating decimal numbers u and v 
for which rounding overflow occurs in addition. 
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► 11. [M20] Give an example of normalized, excess 50, eight-digit floating decimal 
numbers u and v for which rounding overflow occurs in multiplication. 

12. [ M25] Prove that rounding overflow cannot occur during the normalization phase 
of floating point division. 

13. [30] When doing “interval arithmetic” we don’t want to round the results of a 
floating point computation; we want rather to implement operations such as y and A, 
which give the tightest possible representable bounds on the true sum: 

i* yv < u-\-v < uAv. 

How should the algorithms of this section be modified for such a purpose? 

14. [25] Write a MIX subroutine that begins with an arbitrary floating point number 
in register A, not necessarily normalized, and converts it to the nearest fixed point 
integer (or determines that the number is too large in absolute value to make such a 
conversion possible). 

► 15. [28] Write a MIX subroutine, to be used in connection with the other subroutines 
of this section, that calculates u (mod) 1, that is, u — [uj rounded to nearest floating 
point number, given a floating point number u. Note that when u is a very small 
negative number, u (mod) l will be rounded so that the result is unity (even though 
umod 1 has been defined to be always less than unity, as a real number). 

16. [HM21] (Robert L. Smith.) Design an algorithm to compute the real and imagi¬ 
nary parts of the complex number (a-\-bi)/{c-\-di), given real floating point values a, b, 
c, and d. Avoid the computation of c 2 -\-d 2 , since it would cause floating point overflow 
even when |c| or |d| is approximately the square root of the maximum allowable floating 
point value. 

17. [40] (John Cocke.) Explore the idea of extending the range of floating point 
numbers by defining a single-word representation in which the precision of the fraction 
decreases as the magnitude of the exponent increases. 

18. [25] Consider a binary computer with 36-bit words, on which positive floating 
binary numbers are represented as (0eie 2 ■ ■ • e 8 /i/ 2 • • • 727 ) 2 ; here (eic 2 • • • e 8 ) 2 is an 
excess (10000000) 2 exponent and (/i/ 2 ... / 2 7)2 is a 27-bit fraction. Negative floating 
point numbers are represented by the two’s complement of the corresponding positive 
representation (see Section 4.1). Thus, 1.5 is 201 \ 600000000 in octal notation, while 
—1.5 is 576] 200000000; the octal representations of 1.0 and —1.0 are 201 \400000000 
and 576\400000000 , respectively. (A vertical line is used here to show the boundary 
between exponent and fraction.) Note that bit f\ of a normalized positive number is 
always 1, while it is almost always zero for negative numbers; the exceptional cases are 
representations of —2 fc . 

Suppose that the exact result of a floating point operation has the octal code 
572\740000000\01; this (negative) 33-bit fraction must be normalized and rounded to 
27 bits. If we shift left until the leading fraction bit is zero, we get 576[000000000]20, 
but this rounds to the illegal value 576 [000000000; we have over-normalized, since 
the correct answer is 575\400000000. On the other hand if we start (in some other 
problem) with the value 572\740000000\05 and stop before over-normalizing it, we 
get 575\400000000\50, which rounds to the unnormalized number 575\400000001; sub¬ 
sequent normalization yields 576]000000002 while the correct answer is 576[000000001. 

Give a simple, correct rounding rule that resolves this dilemma on such a machine 
(without abandoning two’s complement notation). 
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Round numbers are always false. 
—SAMUEL JOHNSON (1750) 

1 shall speak in round numbers, not absolutely accurate, 
yet not so wide from truth as to vary the result materially. 

—THOMAS JEFFERSON (1824) 


19. [24] What is the running time for the FADD subroutine in Program A, in terms 
of relevant characteristics of the data? What is the maximum running time, over all 
inputs that do not cause overflow or underflow? 


4.2.2. Accuracy of Floating Point Arithmetic 

Floating point computation is by nature inexact, and it is not difficult to misuse 
it so that the computed answers consist almost entirely of “noise.” One of the 
principal problems of numerical analysis is to determine how accurate the results 
of certain numerical methods will be. A “credibility-gap” problem is involved 
here: we don’t know how much of the computer’s answers to believe. Novice 
computer users solve this problem by implicitly trusting in the computer as an 
infallible authority; they tend to believe that all digits of a printed answer are 
significant. Disillusioned computer users have just the opposite approach, they 
are constantly afraid that their answers are almost meaningless. Many a serious 
mathematician has attempted to give rigorous analyses of a sequence of floating 
point operations, but has found the task to be so formidable that he has tried 
to content himself with plausibility arguments instead. 

A thorough examination of error analysis techniques is, of course, beyond the 
scope of this book, but in this section we shall study some of the characteristics of 
floating point arithmetic errors. Our goal is to discover how to perform floating 
point arithmetic in such a way that reasonable analyses of error propagation are 
facilitated as much as possible. 

A rough (but reasonably useful) way to express the behavior of floating point 
arithmetic can be based on the concept of “significant figures” or relative error. 
If we are representing an exact real number x inside a computer by using the 
approximation x = x(l -(- e), the quantity e = (x — x)/x is called the relative 
error of approximation. Roughly speaking, the operations of floating point 
multiplication and division do not magnify the relative error by very much; but 
floating point subtraction of nearly equal quantities (and floating point addition, 
where u is nearly equal to —v) can very greatly increase the relative 
error. So we have a general rule of thumb, that a substantial loss of accuracy is 
expected from such additions and subtractions, but not from multiplications and 
divisions. On the other hand, the situation is somewhat paradoxical and needs 
to be understood properly, since “bad” additions and subtractions are performed 
with perfect accuracy! (See exercise 25.) 
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One of the consequences of the possible unreliability of floating point addi¬ 
tion is that the associative law breaks down: 

(u 0 1 >) 0 w © u 0 (v 0 w), for many u,v,w. (1) 

For example, 

(11111113. 0 —11111111.) 0 7.5111111 = 2.0000000 0 7.5111111 

= 9.5111111; 

11111113. © (—11111111. © 7.5111111) = 11111113. © —11111103. 

= 10 . 000000 . 

(All examples in this section are given in eight-digit floating decimal arithmetic, 
with exponents indicated by an explicit decimal point. Recall that, as in Section 
4.2.1, the symbols 0, 0, 0, 0 are used to stand for floating point operations 
corresponding to the exact operations 0, —, X, /.) 

In view of the failure of the associative law, the comment of Mrs. La Touche 
that appears at the beginning of this chapter [taken from Mat h. Gazette 12 
(1924), 95] makes a good deal of sense with respect to floating point arithmetic. 
Mathematical notations like M ai 0 a 2 © a 3 ” or “E 1<fc<n ak” are inherently 
based upon the assumption of associativity, so a programmer must be especially 
careful that he does not implicitly assume the validity of the associative law. 

A. An axiomatic approach. Although the associative law is not valid, the 
commutative law 

u 0 v = v 0 u (2) 

does hold, and this law can be a valuable conceptual asset in programming and 
in the analysis of programs. This example suggests that we should look for 
important laws that are satified by 0, 0, 0, and 0; it is not unreasonable to 
say that Boating point routines should be designed to preserve as many of the 
ordinary mathematical laws as possible. If more axioms are valid, it becomes 
easier to write good programs, and programs also become more portable from 
machine to machine. 

Let us therefore consider some of the other basic laws that are valid for 
normalized floating point operations as described in the previous section. First 


we have 

u Qv = u® —v; (3) 

~(u 0 v) = — u 0 — v; (4) 

u0ti = O if and only if v — —u\ (5) 

«0O = «. (6) 

From these laws we can derive further identities; for example (exercise I), 

uQv = — (vQu). (7) 
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Identities (2) to (6) are easily deduced from the algorithms in Section 4.2.1. 
The following rule is slightly less obvious: 

if u < v then u 0 w < v 0 w. (8) 

Instead of attempting to prove this rule by analyzing Algorithm 4.2.1A, let us 
go back to the principle underlying the design of that algorithm. (Algorithmic 
proofs aren’t always easier than mathematical ones.) Our idea was that the 
floating point operations should satisfy 

u 0 v — round(u, -j- v), u 0 v = round(i4 — v), 

(y i 

u 0 v = round(w X i>), u 0 v = round(u / v), 

where round(rr) denotes the best floating point approximation to x as defined in 
Algorithm 4.2.IN. We have 

round(—x) = —round(x), (10) 

x < y implies round(x) < round(y), (11) 

and these fundamental relations prove properties (2) through (8) immediately. 
We can also write down several more identities: 

u 0 v — v (g) u, (—ii)0u = — (u<S> v ), l^>v = v; 

u 0 v = 0 if and only if u — 0 or v = 0; 

(— u) 0v = u0 (— v) = —{u 0 v); 

O0t/ = O, 1401 =14, U0u = l. 

If u < v and w > 0, then u<$Z)w < and uQ)w < vQ)w and wQ)u > w0v. 
If u 0 v = u + v, then (u 0 v) 0 v = u; and ifu(&v = uXv^0, then 
(u 0 v) 0 v = u. We see that a good deal of regularity is present in spite of 
the inexactness of the floating point operations, when things have been defined 
properly. 

Several familiar rules of algebra are still, of course, conspicuously absent 
from the collection of identities above; the associative law for floating point 
multiplication is not strictly true, as shown in exercise 3, and the distributive law 
between 0 and 0 can fail rather badly: Let u = 20000.000, v = —6.0000000, 
and w = 6.0000003; then 

(u0v)©(u0«;) = —120000.00 © 120000.01 = .010000000 

u0(u0w) = 20000.000 0 .00000030000000 = .0060000000 


so 


u <g> (t> ® w) {u (g) v) © (u (g) w). 


( 12 ) 
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On the other hand we do have b 0 (v 0 w) = (b 0 v) 0 (b 0 w), when b is the 
floating point radix, since 


round(bz) = 6 round(x). (13) 

(Strictly speaking, the identities and inequalities we are considering in this section 
implicitly assume that exponent underflow and overflow do not occur. The 
function round(z) is undefined when |x| is too small or too large, and equations 
such as (13) hold only when both sides are defined.) 

The failure of Cauchy’s fundamental inequality 


i x l H-h 4)^1 H-f Vi) > { x lVl H-h x nVn) 2 


is another important example of the breakdown of traditional algebra in the 
presence of floating point arithmetic. Exercise 7 shows that Cauchy’s inequality 
can fail even in the simple case n = 2, X\ = X 2 — 1. Novice programmers who 
calculate the standard deviation of some observations by using the textbook 
formula 


a = 



E *!-( E 



(14) 

l<k<n v l<fc<n 

often find themselves taking the square root of a negative number! A much better 
way to calculate means and standard deviations with floating point arithmetic 
is to use the recurrence formulas 


Mi = zi, M/c = Mfc—i © (xk © Mk—i) 0 k, (15) 

Si = 0, Sk = Sic —1 0 (x/t 0 Mic— 1 ) 0 (xk 0 Mfc), (16) 

for 2 < k < n, where a = \/S n /(n — 1). [Cf. B. P. Welford, Technometrics 
4 (1962), 419-420.] With this method S n can never be negative, and we avoid 
other serious problems encountered by the naive method of accumulating sums, 
as shown in exercise 16. (See exercise 19 for a summation technique that provides 
an even better guarantee on the accuracy.) 

Although algebraic laws do not always hold exactly, we can often show that 
they aren’t too far off base. When b e ~ 1 < x < b e we have round(a:) = x-\- p(x), 
where \p(x)\ < \b e ~ v \ hence 

round(z) = x{\ 0 6 ( 2 )), (17) 

where the relative error is bounded independently of x: 

\S(x)\ < i) < J b 1 ”?. (18) 

We can use this inequality to estimate the relative error of normalized floating 
point calculations in a simple way, since u 0 v = (u 0 v)(l + 6(u + v)), etc. 
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As an example of typical error-estimation procedures, let us consider the 
associative law for multiplication. Exercise 3 shows that (u 0 v) 0 w is not in 
general equal to u ® (v ® w); but the situation in this case is much better than 
it was with respect to the associative law of addition (1) and the distributive law 
(12). In fact, we have 

(u ® v) ® w = ((itv)(l + <5i)) 0 w = uvw( 1 + (5i)(1 -f- ^ 2 ), 
u (g) (v (g> w) = u ® ((ww)(l + <5 3 )) = uvw{ 1 -J- <5 3 )( 1 + 5 4 ), 

for some <5i, 82, 6s, 84, provided that no exponent underflow or overflow occurs, 
where \8j\ < p for each j. Hence 

( uj&v)(grw _ (1 + ^i)(l + ^ 2 ) _ .ic 
u®(v®w) (l+5 3 )(l + «4) ’ 

where 

\8\ < 26 1 p /(l — ^ 1_p ) 2 - (19) 

The number 6 1—p occurs so often in such analyses, it has been given a special 
name, one ulp, meaning one “unit in the last place” of the fraction part. Floating 
point operations are correct to within half an ulp, and the calculation of uvw by 
two floating point multiplications will be correct within about one ulp (ignoring 
second-order terms). Hence the associative law for multiplication holds to within 
about two ulps of relative error. 

We have shown that (u 0 v) 0 w is approximately equal to u 0 (v 0 w), 
except when exponent overflow or underflow is a problem. It is worthwhile to 
study this intuitive idea of being “approximately equal” in more detail; can we 
make such a statement more precise in a reasonable way? 

A programmer using floating point arithmetic almost never wants to test if 
two computed values are exactly equal to each other (or at least he hardly ever 
should try to do so), because this is an extremely improbable occurrence. For 
example, if a recurrence relation 




is being used, where the theory in some textbook says that x n approaches a limit 
as n —► 00 , it is usually a mistake to wait until = x n for some n, since 

the sequence x n might be periodic with a longer period due to the rounding of 
intermediate results. The proper procedure is to wait until |£ n +i — x n \ < 8 , for 
some suitably chosen number 8; but since we don’t necessarily know the order 
of magnitude of x n in advance, it is even better to wait until 

|-^n —(— 1 *^n| ^ (20) 

now e is a number that is much easier to select. This relation (20) is another 
way of saying that x n +i and x n are approximately equal; and our discussion 
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indicates that a relation of “approximately equal” would be more useful than the 
traditional relation of equality, when floating point computations are involved, 
if we could only define a suitable approximation relation. 

In other words, the fact that strict equality of floating point values is of 
little importance implies that we ought to have a new operation, floating point 
comparison, which is intended to help assess the relative values of two floating 
point quantities. The following definitions seem to be appropriate for base b, 
excess q, floating point numbers u — (e u , f u ) and v — (e v , f v ): 


u -< v (e) 

if and only if 

v — u > emax(6 eu__(? , b €v ~~ q )\ 

(21) 

u ~ v (e) 

if and only if 

\v — u\ < ema x(b eu ~ q ,b ev ~ q ); 

(22) 

u >- v (e) 

if and only if 

u — v > ema x(b €u ~ q ,b €v ~ q ); 

(23) 

u pa v (e) 

if and only if 

\v — u\ < emin(6 eu ~ <? ,6 et ' —g ). 

(24) 


These definitions apply to unnormalized values as well as to normalized ones. 
Note that exactly one of the conditions u -< v (definitely less than), u ~ v 
(approximately equal to), or u y v (definitely greater than) must always hold 
for any given pair of values u and v. The relation u « v is somewhat stronger 
than u ~ v, and it might be read “u is essentially equal to v.” All of the 
relations are given in terms of a positive real number e that measures the degree 
of approximation being considered. 

One way to view the above definitions is to associate a “neighborhood” set 
N(u) = { x | \x — it| < eb eu ~ q } with each floating point number u; thus, N(u) 
represents a set of values near u based on the exponent of it’s floating point 
representation. In these terms, we have u -< v if and only if N(u) < v and 
u < N(v); u ~ v if and only if u E N(v) or v E N(u); u y v if and only if 
u > N(v) and N(u) > v; u sa v if and only if u E N(v) and v E N(u). (Here we 
are assuming that the parameter e, which measures the degree of approximation, 
is a constant; a more complete notation would indicate the dependence of N(u) 
upon e.) 

Here are some simple consequences of the above definitions: 





if 

u < 

v ( e ' 


then 

v y 

u 

(e); 


(25) 




if 

u « 

v (e 


then 

u ~ 

V 

(«); 


(26) 






u 


“ («); 





(27) 




if 

u 

< V 

(e) 

then 

u 

< 

v; 


(28) 


if 

u 

-< v 

(«l) 

and 

€l 

> c 2 

then 


U -< V 

(C2); 

(29) 


if 

u 


(*l) 

and 

Cl 

< e 2 

then 


U ~ V 

(C2); 

(30) 


if 

u 

V 

(tl) 

and 

Cl 

< c 2 

then 


U « V 

(C2); 

(31) 

if 

u 

< v 

(«l) 

and 

v < 

w 

(«a> 

then 


u < w 

(ci + c 2 ); 

(32) 

if 

u 

« V 

(«i) 

and 

v ~ 

w 

(e 2 ) 

then 


u ~ w 

(ci + €2). 

(33) 
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Moreover, we can prove without difficulty that 

\u — v\<e\u\ and \u — v| < e|v| implies u^v (e); (34) 

| u — v\ < e\u\ or | u — u| < e\v | implies u ~ v (e); (35) 

and conversely, for normalized floating point numbers u and v, when e < 1, 

u ~ v (e) implies \u — v\ < be\u\ and \u — v\ < be\v\] (36) 

u ~ v (e) implies \u — v| < be\u\ or \u — v\ < be\v\. (37) 

Let eo = b 1 ~ p be one ulp. The derivation of (17) establishes the inequality 
\x — round(a:)| = \p(x)\ < Jeo min(|x|, |round(x)|), hence 

x « round(x) (J^o); (38) 

it follows that u 0 v « u -f v (Je 0 ), etc. The approximate associative law for 
multiplication derived above can be recast as follows: We have 

2 f 

\(u <g> v) (g) W — u ® (v <g) lt/)| < T - 2 \u 0 (v (g) w)| 

(i —Je 0 ) 2 

by (19), and the same inequality is valid with (u 0 v ) 0 w and u 0 (v ® w) 
interchanged. Hence by (34), 

(it 0 v) 0 w « u 0 (v ® w) (e) (39) 

whenever e > 2eo/(l — i e o) 2 - For example, if 6 = 10 and p = 8 we may take 
e — 0.00000021. 

The relations -<, ~, >-, and ss are useful within numerical algorithms, 
and it is therefore a good idea to provide routines for comparing floating point 
numbers as well as for doing arithmetic on them. 

Let us now shift our attention back to the question of finding exact relations 
that are satisfied by the floating point operations. It is interesting to note that 
floating point addition and subtraction are not completely intractable from an 
axiomatic standpoint, since they do satisfy the nontrivial identities stated in the 
following theorems. 

Theorem A. Let u and v be normalized floating point numbers . Then 

((u ®?i)0w) + ({u 0 v) © ({u ®v)Qu))~uQv i (40) 

provided that no exponent overflow or underflow occurs. 
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This rather cumbersome-looking identity can be rewritten in a simpler manner: 
Let 

u' = {u 0 v) © v , vf = {u 0 v) © u ; 

(41) 

u" = (u 0 v) © v f , v" — (u 0 v) © vf. 

Intuitively, v! and u" should be approximations to u, and v' and v" should be 
approximations to v. Theorem A tells us that 

u © v = u' + v" — u" -f- v'. (42) 

This is a stronger statement than the identity 

«®v = w / ©v" = tt"®t;' > (43) 

which follows by rounding (42). 

Proof. Let us say that t is a tail of x modulo b e if 

t = x (modulo b e ), |i| < (44) 

thus, x — round(x) is always a tail of x. The proof of Theorem A rests largely 
on the following simple fact proved in exercise 11: 

Lemma T. If t is a tail of the Boating point number x, then xQt = x — t. | 

Let w = u 0 v. Theorem A holds trivially when w — 0. By multiplying all 
variables by a suitable power of b, we may assume without loss of generality that 
e w — V• Then w -j- v = ^ -f- r, where r is a tail of u -T v modulo 1. Furthermore 
u' = round(w — v) = round(w — r) = u — r — t, where t is a tail of u — r 
modulo b e and e — e u > — p. 

If e < 0, then t = u — r = — v (modulo b e ), hence t is a tail of — v and 
v" = round(it) — vf) = round(v -f- t) = v + t; this proves (40). If e > 0, then 
| u — r | > b p — and since |r| < we have \u\ > b p — 1. It follows that r is 
a tail of v modulo 1. If v! = u, we have v" = round(w — u) = round(i> — r) = 
v — r. Otherwise the relation round(« — r) ^ u implies that \u\ = b p — 1, 
l r l = l u/| i = b p ‘, we have v" = round(io — u') = round(n + r) = v + r, 
because r is also a tail of —v in this case. | 

Theorem A exhibits a regularity property of floating point addition, but it 
doesn’t seem to be an especially useful result. The following identity is more 
significant: 

Theorem B. Under the hypotheses of Theorem A and (41), 

« + ^ = (M©^) + ((MQ«')®(liQ V")). (45) 

Proof. In fact, we can show that u © u' = u — u', v Q v" = v — v", and 
(u — k/) © (v — v") = (u — u') © (v — v"), hence (45) will follow from Theorem A. 
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Using the notation of the preceding proof, these relations are respectively equiv¬ 
alent to 


round(t -\~r) = t-\-r, round(t) = t, round(r) = r. (46) 

Exercise 12 establishes the theorem in the special case \e u — e v \ > p. Otherwise 
u 0 v has at most 2 p significant digits and it is easy to see that round(r) = r. 
If now e > 0, the proof of Theorem A shows that t = —r or t = r = 

If e < 0 we have t 0 r = u and t = — v (modulo 6 e ); this is enough to prove 
that t — (— r and r round to themselves, provided that e u > e and e v > e. But 
either e u < 0 or e v < 0 would contradict our hypothesis that \e u — e v \ < p, 
since e w = p. | 

Theorem B gives an explicit formula for the difference between u 0 v and 
in terms of quantities that can be calculated directly using five operations 
of floating point arithmetic. If the radix b is 2 or 3, we can improve on this 
result, obtaining the exact value of the correction term with only two floating 
point operations and one (fixed point) comparison of absolute values: 

Theorem C. Ifb< 3 and |u| > \v\, then 

« + ti = (u®t/) + (u©(u©D))®v. (47) 

Proof. Following the conventions of preceding proofs again, we wish to show 
that v 0 v / — t. It suffices to show that v f = w — u, because (46) will then yield 
v Qv' — roundfv — v') = round(w v — w) = round(r) = r. 

We shall in fact prove (47) whenever b < 3 and e u > e v . If e u > p, then r 
is a tail of v modulo 1, hence v' = wQu = vQr = v — r = w — u as desired. 
If e u < p, then we must have e u — p — 1, and w — u is a multiple of b~ l ; it will 
therefore round to itself if its magnitude is less than b p ~ 1 -f- b~ 1 . Since b < 3, 
we have indeed \w — u\ < \w — u — v\ + |f | < \ -{-(b p ~ 1 — b~ l ) < b p ~ 1 -\-b~ l . 
This completes the proof. | 

The proofs of Theorems A, B, and C do not rely on the precise definitions of 
round(x) in the ambiguous cases when x is exactly midway between consecutive 
floating point numbers; any way of resolving the ambiguity will suffice for the 
validity of everything we have proved so far. 

No rounding rule can be best for every application. For example, we gener¬ 
ally want a special rule when computing our income tax. But for most numerical 
calculations the best policy appears to be the rounding scheme specified in 
Algorithm 4.2.IN, which insists that the least significant digit should always 
be made even (or always odd) when an ambiguous value is rounded. This is not 
a trivial technicality, of interest only to nit-pickers; it is an important practical 
consideration, since the ambiguous case arises surprisingly often and a biased 
rounding rule produces significantly poor results. For example, consider decimal 
arithmetic and assume that remainders of 5 are always rounded upwards. Then 
if it = 1.0000000 and v = 0.55555555 we have u 0 v — 1.5555556; and if 
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we floating-subtract v from this result we get u' = 1.0000001. Adding and 
subtracting v from u' gives 1.0000002, and the next time we get 1.0000003, etc.; 
the result keeps growing although we are adding and subtracting the same value. 

This phenomenon, called drift, will not occur when we use a stable rounding 
rule based on the parity of the least significant digit. More precisely: 

Theorem D. (((u ® v) @ v) 0 v) 0 v = (u 0 v) 0 v. 

For example, if u = 1.2345679 and v = —0.23456785, we find u ® v = 
1.0000000, (u ® v) 0 V = 1.2345678, ((it ® v) 0 v) ® v = 0.99999995, and 
(((u ® v) 0 i>) ® u) 0 v = 1.2345678. The proof for general u and v seems to 
require a case analysis even more detailed than that in the above theorems; see 
the references at the end of this section. | 

Theorem D is valid both for “round to even” and “round to odd”; how should 
we choose between these possibilities? When the radix 6 is odd, ambiguous cases 
never arise except during floating point division, and the rounding in such cases 
is comparatively unimportant. For even radices, there is reason to prefer the 
following rule: “Round to even when 6/2 is odd, round to odd when 6/2 is 
even.” The least significant digit of a floating point fraction occurs frequently 
as a remainder to be rounded off in subsequent calculations, and this rule avoids 
generating the digit 6/2 in the least significant position whenever possible; its 
effect is to provide some memory of an ambiguous rounding so that subsequent 
rounding will tend to be unambiguous. For example, if we were to round to odd 
in the decimal system, repeated rounding of the number 2.44445 to one less place 
each time leads to the sequence 2.4445, 2.445, 2.45, 2.5, 3; but if we round to 
even, such situations do not occur. [Roy A. Keir, Inf. Proc. Letters 3 (1975), 
188-189.] On the other hand, some people prefer rounding to even in all cases, so 
that the remainder will tend to be 0 more often. Neither alternative conclusively 
dominates the other; fortunately the base is usually 6 = 2 or 6 = 10, when 
everyone agrees that round-to-even is best. 

A reader who has checked some of the details of the above proofs will realize 
the immense simplification that has been afforded by the simple rule u ® v = 
round(u ® v ). If our floating point addition routine would fail to give this result 
even in a few rare cases, the proofs would become enormously more complicated 
and perhaps they would even break down completely. 

Theorem B fails if truncation arithmetic is used in place of rounding, i.e., 
if we let u ® v = trunc(u ® v) and u 0 v — trunc(u — v), where trunc(x) 
takes all positive real x into the largest floating point number < x. An ex¬ 
ception to Theorem B would then occur for cases such as (20, 0.10000001) ® 
(10, —.10000001) = (20, ®.10000000), when the difference between u 0 v and 
u ® v cannot be expressed exactly as a floating point number; and also for cases 
such as 12345678 ® .012345678, when it can be. 

Many people feel that, since floating point arithmetic is inexact by nature, 
there is no harm in making it just a little bit less exact in certain rather rare cases, 
if it is convenient to do so. This policy saves a few cents in the design of computer 
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hardware, or a small percentage of the average running time of a subroutine. But 
the above discussion shows that such a policy is mistaken. We could save about 
five percent of the running time of the FADD subroutine, Program 4.2.1A, and 
about 25 percent of its space, if we took the liberty of rounding incorrectly in a 
few cases, but we are much better off leaving it as it is. The reason is not to glorify 
“bit chasing”; a more fundamental issue is at stake here: Numerical subroutines 
should deliver results that satisfy simple, useful mathematical laws whenever 
possible. The crucial formula u 0 v = round(w 0 v) is a “regularity” property 
that makes a great deal of difference between whether mathematical analysis 
of computational algorithms is worth doing or worth avoiding. Without any 
underlying symmetry properties, the job of proving interesting results becomes 
extremely unpleasant. The enjoyment of one’s tools is an essential ingredient of 
successful work. 

B. Unnormalized floating point arithmetic. The policy of normalizing all floating 
point numbers may be construed in two ways: We may look on it favorably by 
saying that it is an attempt to get the maximum possible accuracy obtainable 
with a given degree of precision, or we may consider it to be potentially dangerous 
since it tends to imply that the results are more accurate than they really 
are. When we normalize the result of (1,-(-.31428571) 0 (1, 0.31415927) to 
(—2,0.12644000), we are suppressing information about the possibly greater 
inaccuracy of the latter quantity. Such information would be retained if the 
answer were left as (1,0.00012644). 

The input data to a problem is frequently not known as precisely as the 
floating point representation allows. For example, the values of Avogadro’s 
number and Planck’s constant are not known to eight significant digits, and 
it might be more appropriate to denote them, respectively, by 

(27,0.00060225) and (—23,0.00010545) 

instead of by (24,0.60225200) and (—26,0.10545000). It would be nice if 
we could give our input data for each problem in an unnormalized form that 
expresses how much precison is assumed, and if the output would indicate just 
how much precision is known in the answer. Unfortunately, this is a terribly 
difficult problem, although the use of unnormalized arithmetic can help to give 
some indication. For example, we can say with a fair degree of certainty that 
the product of Avogadro’s number by Planck’s constant is (0, 0.00063507), and 
that their sum is (27,0.00060225). (The purpose of this example is not to 
suggest that any important physical significance should be attached to the sum 
and product of these fundamental constants; the point is that it is possible to 
preserve a little of the information about precision in the result of calculations 
with imprecise quantities, when the original operands are independent of each 
other.) 

The rules for unnormalized arithmetic are simply this: Let l u be the number 
of leading zeros in the fraction part of u = (e u , f u ), so that l u is the largest 
integer < p with \f u \ < b~ lu . Then addition and subtraction are performed 
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just as in Algorithm 4.2.1A, except that all scaling to the left is suppressed. 
Multiplication and division are performed as in Algorithm 4.2. 1 M, except that 
the answer is scaled right or left so that precisely max(7 u , l v ) leading zeros appear. 
Essentially the same rules have been used in manual calculation for many years. 
It follows that, for unnormalized computations, 

e u0v , e U Qv = max(e u , e v ) -®- (0 or 1) (48) 

e u <g>v = e u + e v — q — min(Z u , l v ) — (0 or 1) (49) 

e U Q)v = eu — e v -\-q — l u -\-l v + max(Z u , l v ) + (0 or 1). (50) 

When the result of a calculation is zero, an unnormalized zero (often called an 
“order of magnitude zero”) is given as the answer; this indicates that the answer 
may not truly be zero, we just don’t know any of its significant digits. 

Error analysis takes a somewhat different form with unnormalized floating 
point arithmetic. Let us define 

6 U = hb e «-«-r if u = (e u J u ). (51) 

This quantity depends on the representation of u, not just on the value b eu ~ q f u . 
Our rounding rule tells us that 

\u 0 V — {u -f v)\ < 6 U0V , \u 0 V — (u — v)\ < 6 U Q V , 

\u 0 v — (u X v)| < 6 u <g>v> \u0V— {u / v)\ < du^v 

These inequalities apply to normalized as well as unnormalized arithmetic; the 
main difference between the two types of error analysis is the definition of the 
exponent of the result of each operation (Eqs. (48) to (50)). 

We have remarked that the relations -<, >-, and ^ defined earlier in 

this section are valid and meaningful for unnormalized numbers as well as for 
normalized numbers. As an example of the use of these relations, let us prove 
an approximate associative law for unnormalized addition, analogous to (39): 

(u®t;)®«;^«®(D0ii;) (e), (52) 

for suitable e. We have 

|(u 0 v) © V> — {u + V 4- w)\ < |(u 0 v) 0 w — ((u © v) + w)| 

+ \u 0 v — {u + u)| 

— ^('i*©'u)© u ' H - buQ)v 

w • 

A similar formula holds for \u 0 (v 0 w) — (u -f- v -(- iy)|. Now since e( U 0 V ) 0 W = 
max(e 1X) e v ,e w ) + (0, 1, or 2), we have < 5 (^) 0 ™) < 5 2 ( 5 u 0 ( l , 0 W ). Therefore we 
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find that (52) is valid when e > 2 b 2 ~ p ; unnormalized addition is not as erratic 
as normalized addition with respect to the associative law. 

It should be emphasized that unnormalized arithmetic is by no means a 
panacea. There are examples where it indicates greater accuracy than is present 
(e.g., addition of a great many small quantities of about the same magnitude, 
or evaluation of x n for large n); and there are many more examples when it 
indicates poor accuracy while normalized arithmetic actually does produce good 
results. There is an important reason why no straightforward one-operation-at- 
a-time method of error analysis can be completely satisfactory, namely the fact 
that operands are usually not independent of each other. This means that errors 
tend to cancel or reinforce each other in strange ways. For example, suppose 
that x is approximately and suppose that we have an approximation y = 
x -)- 6 with absolute error 6. If we now wish to compute x(l — a;), we can form 
y( 1 — y)\ if x = 5 -f e we find y( 1 — y) = x(l — x) — 2 e6 — 6 2 , so the error 
has decreased substantially: it has been multiplied by a factor of 2e -J- 6. This 
is just one case where multiplication of imprecise quantities can lead to a quite 
accurate result when the operands are not independent of each other. A more 
obvious example is the computation of xQx, which can be obtained with perfect 
accuracy regardless of how bad an approximation to x we begin with. 

The extra information that unnormalized arithmetic gives us can often be 
more important than the information it destroys during an extended calcula¬ 
tion, but (as usual) we must use it with care. Examples of the proper use of 
unnormalized arithmetic are discussed by R. L. Ashenhurst and N. Metropolis 
in Computers and Computing, AMM Slaught Memorial Papers 10 (February, 
1965), 47-59; by N. Metropolis in Numer. Math. 7 (1965), 104-112; and by R. L. 
Ashenhurst in Error in Digital Computation 2, ed. by L. B. Rail (New York: 
Wiley, 1965), 3-37. Appropriate methods for computing standard mathematical 
functions with both input and output in unnormalized form are given by R. L. 
Ashenhurst in JACM 11 (1964), 168-187. An extension of unnormalized arith¬ 
metic, which remembers that certain values are known to be exact, has been 
discussed by N. Metropolis in IEEE Trans. C-22 (1973), 573-576. 

C. Interval arithmetic. Another approach to the problem of error determination 
is the so-called interval or range arithmetic, in which upper and lower bounds 
on each number are maintained during the calculations. Thus, for example, if 
we know that u 0 < u < u\ and Vo < v <. Vi, we represent this by the interval 
notation u = [u 0 , Ui], v = [vo, Vi]. The sum is [u 0 v v 0 , ui A tq], where v 
denotes “lower floating point addition,” the greatest representable number less 
than or equal to the true sum, and A is defined similarly (see exercise 4.2.1-13). 
Furthermore u 0 v = ^o]; and if u o and vq are positive, we have 

u (0 v = [uq v v 0 , u\ Avx], u 0 v = [ii 0 V v u u i A vq]. For example, we might 
represent Avogadro’s number and Planck’s constant as 


N = [(24, +.60222400), (24, +.60228000)], 
h = [(-26,+.10544300), (—26,+.10545700)]; 
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their sum and product would then turn out to be 

N 0 h = [(24, +-60222400), (24, +.60228001)], 

N<S>h = [(—3, +.63500305), (—3, +.63514642)]. 

If we try to divide by [vo,t>i] when Vo < 0 < Vi, there is a possibility 
of division by zero. Since the philosophy underlying interval arithmetic is to 
provide rigorous error estimates, a divide-by-zero error should be signalled in this 
case. However, overflow and underflow need not be treated as errors in interval 
arithmetic, if special conventions are introduced as discussed in exercise 24. 

Interval arithmetic takes only about twice as long as ordinary arithmetic, 
and it provides truly reliable error estimates. Considering the difficulty of 
mathematical error analyses, this is indeed a small price to pay. Since the 
intermediate values in a calculation often depend on each other, as explained 
above, the final estimates obtained with interval arithmetic will tend to be 
pessimistic; and iterative numerical methods often have to be redesigned if we 
want to deal with intervals. The prospects for effective use of interval arithmetic 
look very good, however, and efforts should be made to increase its availability. 

D. History and bibliography. Jules Tannery’s classic treatise on decimal calcula¬ 
tions, Legons d’Arithmetique (Paris: Colin, 1894), stated that positive numbers 
should be rounded upwards if the first discarded digit is 5 or more; since exactly 
half of the decimal digits are 5 or more, he felt that this rule would round up¬ 
wards exactly half of the time, on the average, so it would produce compensating 
errors. The idea of “round to even” in the ambiguous cases seems to have been 
mentioned first by James B. Scarborough in the first edition of his pioneering 
book Numerical Mathematical Analysis (Baltimore: Johns Hopkins Press, 1930), 
p. 2; in the second (1950) edition he amplified his earlier remarks, stating that “It 
should be obvious to any thinking person that when a 5 is cut off, the preceding 
digit should be increased by 1 in only half the cases,” and he recommended 
round-to-even in order to achieve this. 

The first analysis of floating point arithmetic was given by F. L. Bauer 
and K. Samelson, Zeitschrift fur angewandte Math, und Physik 4 (1953), 312- 
316. The next publication was not until over five years later: J. W. Carr m, 
CACM 2, 5 (May 1959), 10-15. See also P. C. Fischer, Proc. ACM Nat. Meeting 
13 (Urbana, Illinois, 1958), paper 39. The book Rounding Errors in Algebraic 
Processes (Englewood Cliffs: Prentice-Hall, 1963), by J. H. Wilkinson, shows 
how to apply error analysis of the individual arithmetic operations to the error 
analysis of large-scale problems; see also his treatise on The Algebraic Eigenvalue 
Problem (Oxford: Clarendon Press, 1965). 

More recent work on floating point accuracy is summarized in two important 
papers that can be especially recommended for further study: W. M. Kahan, 
Proc. IFIP Congress (1971), 2, 1214-1239; R. P. Brent, IEEE Trans. C-22 (1973), 
601-607. Both papers include useful theory and demonstrate that it pays off in 
practice. 
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The relations -<, ~, >-, « introduced in this section are similar to ideas 
published by A. van Wijngaarden in BIT 6 (1966), 66-81. Theorems A and B 
above were inspired by some related work of Ole Mpller, BIT 5 (1965), 37-50, 
251-255; Theorem C is due to T. J. Dekker, Numer. Math. 18 (1971), 224-242. 
Extensions and refinements of all three theorems have been published by S. 
Linnainmaa, BIT 14 (1974), 167-202. W. M. Kahan introduced Theorem D in 
some unpublished notes; for a complete proof and further commentary, see J. F. 
Reiser and D. E. Knuth, Inf. Proc. Letters 3 (1975), 84-87, 164. 

Unnormalized floating point arithmetic was recommended by F. L. Bauer 
and K. Samelson in the article cited above, and it was independently used by 
J. W. Carr m at the University of Michigan in 1953. Several years later, the 
MANIAC m computer was designed to include both kinds of arithmetic in 
its hardware; see R. L. Ashenhurst and N. Metropolis, JACM 6 (1959), 415- 
428, IEEE Trans. EC-12 (1963), 896-901; R. L. Ashenhurst, Proc. Spring Joint 
Computer Conf. 21 (1962), 195-202. See also H. L. Gray and C. Harrison, Jr., 
Proc. Eastern Joint Computer Conf. 16 (1959), 244-248, and W. G. Wadey, 
JACM 7 (1960), 129-139, for further early discussions of unnormalized arith¬ 
metic. 

For early developments in interval arithmetic, and some modifications, see 
A. Gibb, CACM 4 (1961), 319-320; B. A. Chartres, JACM 13 (1966), 386-403; 
and the book Interval Analysis by Ramon E. Moore (Prentice-Hall, 1966). The 
subsequent flourishing of this subject is described in Moore’s later book, Methods 
and Applications of Interval Analysis (SIAM, 1979). 

The book Grundlagen des Numerischen Rechnens: Mathematische Begrund- 
ung der Rechenarithmetik by Ulrich Kulisch (Mannheim: Bibl. Inst., 1976) is 
entirely devoted to the study of floating point arithmetic systems; see also 
Kulisch’s article in IEEE Trans. C-26 (1977), 610-621, and his more recent book 
written jointly with W. L. Miranker, entitled Computer Arithmetic in Theory 
and Practice (New York: Academic Press, 1980). 


EXERCISES 

Note: Normalized floating point arithmetic is assumed unless the contrary is specified. 

1. [M18] Prove that identity (7) is a consequence of (2) through (6). 

2. [ M20] Use identities (2) through (8) to prove that (u ® i) ® (u ® t/) > u 0 v 
whenever x > 0 and y > 0. 

3. [ M20] Find eight-digit floating decimal numbers u, v, and w such that 

u ® (v <S> w) 7^ (u 0 v) (g) w, 

and such that no exponent overflow or underflow occurs during the computations. 

4. [10] Is it possible to have floating point numbers u, v, and w for which exponent 
overflow occurs during the calculation of « ® (u ®) w) but not during the calculation 
of (u £*) v) <g) w ? 
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5. [M20\ Is u 0 v = u 0 (1 0 v) an identity, for all floating point numbers u and 
v + 0 such that no exponent overflow or underflow occurs? 

6. [M22] Are either of the following two identities valid for all floating point num¬ 
bers u? (a) 0 © (0 © u) = u; (b) 1 0 (1 0 u) = u. 

7. [M21] Let u® stand for Find floating binary numbers u and v such that 

2 (u® + w®) < (w 0 v)®. 

► 8. [20] Let e = 0.0001; which of the relations 

u -< v (e), u ~ v (e), u v (e), u ^ v (e) 

hold for the following pairs of base 10, excess 0, eight-digit floating point numbers? 

a) u = (1, +.31415927), v = (1, +.31416000); 

b) u = (0, +.99997000), v = (1, +.10000039); 

c ) u = (24, +.60225200), v = (27, +.00060225); 

d ) u = (24, +.60225200), v = (31, +.00000006); 

e) u = (24, +.60225200), u = (32, +.00000000). 

9. [ M22 ] Prove (33), and explain why the conclusion cannot be strengthened to the 
relation u ^ w {e i + £ 2 ). 

► 10. [ M25] (W. M. Kahan.) A certain computer performs floating point arithmetic 
without proper rounding, and, in fact, its floating point multiplication routine ignores 
all but the first p most significant digits of the 2p-digit product f u fv (Thus when 
fufv < 1 /b, the least-significant digit of u <g) v always comes out to be zero, due to 
subsequent normalization.) Show that this causes the monotonicity of multiplication 
to fail; i.e., there are positive normalized floating point numbers u, v, w such that 
u < v but u(g)w > v (x) tu. 

11. [M20] Prove Lemma T. 

12. [M24] Carry out the proof of Theorem B and (46) when \e u — e v \ > p. 

► 13. [ M25 ] Some programming languages (and even some computers) make use of 
floating point arithmetic only, with no provision for exact calculations with integers. If 
operations on integers are desired, we can, of course, represent an integer as a floating 
point number; and when the floating point operations satisfy the basic definitions in 
(9), we know that all floating point operations will be exact, provided that the operands 
and the answer can each be represented exactly with p significant digits. Therefore—so 
long as we know that the numbers aren’t too large—we can add, subtract, or multiply 
integers with no inaccuracy due to rounding errors. 

But suppose that a programmer wants to determine if m is an exact multiple of n, 
when m and n + 0 are integers. Suppose further that a subroutine is available to 
calculate the quantity round(u mod 1) — u ( mod) 1 for any given floating point num¬ 
ber u, as in exercise 4.2.1-15. One good way to determine whether or not m is a 
multiple of n might be to test whether or not (m 0 n) ( mod ) 1 = 0, using the assumed 
subroutine; but perhaps rounding errors in the floating point calculations will invalidate 
this test in certain cases. 

Find suitable conditions on the range of integer values n + 0 and m, such that m 
is a multiple of n if and only if (m 0 n) ( mod) 1 = 0. In other words, show that if m 
and n are not too large, this test is valid. 
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14. [M27\ Find a suitable e such that (u ® v) ® w m u ® (u ® u;) (e), when 
unnormalized multiplication is being used. (This generalizes (39), since unnormalized 
multiplication is exactly the same as normalized multiplication when the input operands 
u, v, and w are normalized.) 

► 15. [M24] (H. Bjork.) Does the computed midpoint of an interval always lie between 
the endpoints? (In other words, does u < v imply that u < (u 0 v) 0 2 < v?) 

16. [M28] (a) What is (• • ■ ((xi 0 x 2 ) 0 x 3 ) 0 • ■ • ® x n ) when n = 10 6 and x k = 
1.1111111 for all k, using eight-digit floating decimal arithmetic? (b) What happens 
when Eq. (14) is used to calculate the standard deviation of these particular values x k ? 
What happens when Eqs. (15) and (16) are used instead? (c) Prove that S k > 0 in 
(16), for all choices of x \, ..., x k . 

17. [28] Write a MIX subroutine, FCMP, that compares the floating point number u in 
location ACC with the floating point number v in register A, and that sets the comparison 
indicator to LESS, EQUAL, or GREATER, according as u -< v, u ~ v, or u >~ v (e); here 
e is stored in location EPSILON as a nonnegative fixed point quantity with the decimal 
point assumed at the left of the word. Assume normalized inputs. 

18. [M40] In unnormalized arithmetic is there a suitable number e such that 

u ® {v ® w) « {u ® v) ® (u 0 w) (e)? 

► 19. [M30] (W. M. Kahan.) Consider the following procedure for floating point sum¬ 
mation of Xi,..., x n : 

so = Co = 0; 

yk—x k Qc k -!, s k = Sfc_i ® y k , c k = (s k © s fc _i) 0 y k , for 1 < k < n. 
Let the relative errors in these operations be defined by the equations 

yk = {x k — c k - i)(l + rjk), s k = {Sk-i ® y k )( 1 0 <r k ), 
c k = ((sfc — Sfc_i)(l 0 q'fc) — ?/fc)(l 0 6k), 

where \r) k \, \a k \, |qf fc |, \6 k \ < t. Prove that s n = £1 < fc < n (! 0 0 k )x k , where |0 fc | < 2e0 

0(ne 2 ). [Theorem C says that if b = 2 and |sfc_i| > \y k \ we have s k —1 -\-yk = s k — c k 
exactly. But in this exercise we want to obtain an estimate that is valid even when 
floating point operations are not carefully rounded , assuming only that each operation 
has bounded relative error.] 

20. [25] (S. Linnainmaa.) Find all u,v for which \u\ > |u| and (47) fails. 

21. [M35] (T. J. Dekker.) Theorem C shows how to do exact addition of '’oating 
binary numbers. Explain how to do exact multiplication: Express the produce uv in 
the form w-\-w', where w and w' are computed from two given floating binary numbers 
u and v, using only the operations ®, ©, and 0. 

22. [ M30 ] Can drift occur in floating point multiplication/division? Consider the 
sequence Xo = u, X 2 n+i — x 2n 0 v, x 2n +2 = x 2n +i 0 v, given u and v; what is the 
largest subscript k such that x k 0 Xfc +2 is possible? 

► 23. [M26] Prove or disprove: uQ{u ( mod) 1) = |_wj> for all floating point u. 
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24. [ M27) Consider the set of all intervals [ ui , u r ], where ui and u r are either nonzero 
floating point numbers or the special symbols +0, —0, +oo, —oo; each interval must 
have ui < u r , and ui = u T is allowed only when ui is finite and nonzero. The interval 
[ui, u r ] stands for all floating point x such that u\ < x < u r , where we regard 

—oo < — x < —0 < +0 < -\~ x < +°° 

for all positive x. (Thus, [1,2] means 1 < x < 2; [+0,1] means 0 < x < 1; [—0,1] 
means 0 < x < 1; [—0, -fO] denotes the single value 0; and [—oo, +oo] stands for 
everything.) Show how to define appropriate arithmetic operations on all such intervals, 
without resorting to “overflow” or “underflow” or other anomalous indications except 
when dividing by an interval that includes zero. 

► 25. [15] When people speak about inaccuracy in floating point arithmetic they often 
ascribe errors to “cancellation” that occurs during the subtraction of nearly equal 
quantities. But when u and v are approximately equal, the difference u 0 v is obtained 
exactly, with no error. What do these people really mean? 

26. [ HM30] (H. G. Diamond.) Suppose f(x) is a strictly increasing function on some 
interval [x 0 ,xi], and let g(x) be the inverse function. (For example, / and g might 
be “exp” and “In”, or “tan” and “arctan”.) If x is a floating point number such that 
x o < x < x\, let f(x) = round(/(x)), and if y is a floating point number such that 
f(x o) < y < f(x i), let g(y ) = round(<?(y)); furthermore, let h(x) = g(/(x)), whenever 
this is defined. Although h(x) won’t always be equal to x, due to rounding, we expect 
h{x) to be “near” x. 

Prove that if the precision b p is at least 3, and if / is strictly concave or strictly 
convex (i.e., f"(x) < 0 or f"(x) > 0 for all x in [x 0 ,Xi]), then repeated application of 
h will be stable in the sense that 

h(h(h(x))) = h(h(x)), 

for all x such that both sides of this equation are defined. In other words, there will be 
no “drift” if the subroutines are properly implemented. 

► 27. [ M25 ] (W. M. Kahan.) Let f{x) = 1 + x 4-b z 106 = (1 — x 107 )/(l - x) 

for x < 1, and let g(y) = /((£ — y 2 )(2> 3ASy 2 )) for 0 < y < 1. Evaluate g(y) on 

one or more “pocket calculators,” for y = 10~ 3 , 10~ 4 , 10 —5 , 10 6 , and explain all 
inaccuracies in the results you obtain. (Since most present-day calculators do not round 
correctly, the results are often surprising. Note that g(e) = 107 — 10491.35e 2 + 0(e 4 ).) 


*4.2.3. Double-Precision Calculations 

Up to now we have considered “single-precision” floating point arithmetic, which 
essentially means that the floating point values we have dealt with can be stored 
in a single machine word. When single-precision floating point arithmetic does 
not yield sufficient accuracy for a given application, the precision can be increased 
by suitable programming techniques that use two or more words of memory to 
represent each number. 

Although we shall discuss the general question of high-precision calculations 
in Section 4.3, it is appropriate to give a separate discussion of double-precision 




4 . 2.3 


DOUBLE-PRECISION CALCULATIONS 231 


here. Special techniques apply to double precision that are comparatively inap¬ 
propriate for higher precisions; and double precision is a reasonably important 
topic in its own right, since it is the first step beyond single precision and it is 
applicable to many problems that do not require extremely high precision. 

Double-precision calculations are almost always required for floating point 
rather than fixed point arithmetic, except perhaps in statistical work where fixed 
point double-precision is commonly used to calculate sums of squares and cross 
products; since fixed point versions of double-precision arithmetic are simpler 
than floating point versions, we shall confine our discussion here to the latter. 

Double precision is quite frequently desired not only to extend the precision 
of the fraction parts of floating point numbers, but also to increase the range of 
the exponent part. Thus we shall deal in this section with the following two-word 
format for double-precision floating point numbers in the MIX computer: 

±|e|e|/|/|/| | / | / | / | / | / | . (1) 

Here two bytes are used for the exponent and eight bytes are used for the fraction. 
The exponent is “excess b 2 / 2,” where b is the byte size. The sign will appear in 
the most significant word; it is convenient to ignore the sign of the other word 
completely. 

Our discussion of double-precision arithmetic will be quite machine-oriented, 
because it is only by studying the problems involved in coding these routines 
that a person can properly appreciate the subject. A careful study of the MIX 
programs below is therefore essential to the understanding of the material. 

In this section we shall depart from the idealistic goals of accuracy stated 
in the previous two sections; our double-precision routines will not round their 
results, and a little bit of error will sometimes be allowed to creep in. Users dare 
not trust these routines too much. There was ample reason to squeeze out every 
possible drop of accuracy in the single-precision case, but now we face a different 
situation: (a) The extra programming required to ensure true double-precision 
rounding in all cases is considerable; fully accurate routines would take, say, 
twice as much space and half again as much more time. It was comparatively 
easy to make our single-precision routines perfect, but double precision brings 
us face to face with our machine’s limitations. (A similar situation occurs with 
respect to other floating point subroutines; we can’t expect the cosine routine 
to compute round(cosx) exactly for all x, since that turns out to be virtually 
impossible. Instead, the cosine routine should provide the best relative error it 
can achieve with reasonable speed, for all reasonable values of x. Of course, the 
designer of the routine should try to make the computed function satisfy simple 
mathematical laws whenever possible (e.g., (cos)(— x) = (cos)x, |(cos)x| < 1, 
and (cos)x > (cos)y for 0 < x < y < 7r).) (b) Single-precision arithmetic is 
a “staple food” that everybody who wants to employ floating point arithmetic 
must use, but double precision is usually for situations where such clean results 
aren’t as important. The difference between seven- and eight-place accuracy can 
be noticeable, but we rarely care about the difference between 15- and 16-place 
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accuracy. Double precision is most often used for intermediate steps during the 
calculation of single-precision results; its full potential isn’t needed, (c) It will 
be instructive for us to analyze these procedures in order to see how inaccurate 
they can be, since they typify the types of short cuts generally taken in bad 
single-precision routines (see exercises 7 and 8). 

Let us now consider addition and subtraction operations from this stand¬ 
point. Subtraction is, of course, converted to addition by changing the sign of 
the second operand. Addition is performed by separately adding together the 
least-significant halves and the most-significant halves, propagating “carries” 
appropriately. 

A difficulty arises, however, since we are doing signed-magnitude arithmetic: 
it is possible to add the least-significant halves and to get the wrong sign (namely, 
when the signs of the operands are opposite and the least-significant half of the 
smaller operand is bigger than the least-significant half of the larger operand). 
The simplest solution is to anticipate the correct sign; so in step A2 (cf. Algorithm 
4.2.1A), we not only assume that e u > e v , we also assume that |u| > |u|. This 
means we can be sure that the final sign will be the sign of u. In other respects, 
double-precision addition is very much like its single-precision counterpart, only 
everything is done twice. 

Program A ( Double-precision addition). The subroutine DFADD adds a double¬ 
precision floating point number v, having the form (1), to a double-precision 
floating point number u, assuming that v is initially in rAX (i.e., registers A 
and X), and that u is initially stored in locations ACC and ACCX. The answer 
appears both in rAX and in (ACC, ACCX). The subroutine DFSUB subtracts v from 
u under the same conventions. 

Both input operands are assumed to be normalized, and the answer is 
normalized. The last portion of this program is a double-precision normalization 
procedure that is used by other subroutines of this section. Exercise 5 shows how 
to improve the program significantly. 


01 

ABS 

EQU 

1:5 

Field definition for absolute value 

02 

SIGN 

EQU 

0:0 

Field definition for sign 

03 

EXPD 

EQU 

1:2 

Double-precision exponent field 

04 

DFSUB 

STA 

TEMP 

Double-precision subtraction: 

05 


LDAN 

TEMP 

Change sign of v. 

06 

DFADD 

STJ 

EXITDF 

Double-precision addition: 

07 


CMPA 

ACC (ABS) 

Compare |u| with |v . 

08 


JG 

IF 


09 


JL 

2F 


10 


CMPX 

ACCX(ABS) 


11 


JLE 

2F 


12 

1H 

STA 

ARG 

If |u| < M, interchange u <->• v. 

13 


STX 

ARGX 


n 


LDA 

ACC 


15 


LDX 

ACCX 


16 


ENT1 

ACC 

(ACC and ACCX are in consecutive 

17 


MOVE 

ARG(2) 

locations.) 
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18 

2H 

STA 

TEMP 

19 


LD1N 

TEMP(EXPD) 

20 


LD2 

ACC(EXPD) 

21 


INC1 

0,2 

22 


SLAX 

2 

23 


SRAX 

1,1 

24 


STA 

ARG 

25 


STX 

ARGX 

26 


STA 

ARGX(SIGN) 

27 


LDA 

ACC 

28 


LDX 

ACCX 

29 


SLAX 

2 

30 


STA 

ACC 

31 


SLAX 

4 

32 


ENTX 

1 

33 


STX 

EXPO 

34 


SRC 

1 

35 


STA 

IF(SIGN) 

36 


ADD 

ARGX(0:4) 

37 


SRAX 

4 

38 

1H 

DECA 

1 

39 


ADD 

ACC (0:4) 

40 


ADD 

ARG 

41 

DNORM 

JANZ 

IF 

42 


JXNZ 

IF 

43 

DZERO 

STA 

ACC 

44 


JMP 

9F 

45 

2H 

SLAX 

1 

46 


DEC2 

1 

47 

1H 

CMPA 

=0=(1:1) 

48 


JE 

2B 

49 


SRAX 

2 

50 


STA 

ACC 

51 


LDA 

EXPO 

52 


INCA 

0,2 

53 


JAN 

EXPUND 

54 


STA 

ACC(EXPD) 

55 


CMPA 

=1(3:3)= 

56 


JL 

8F 

57 

EXP OVD 

HLT 

20 

58 

EXPUND 

HLT 

10 

59 

8H 

LDA 

ACC 

60 

9H 

STX 

ACCX 

61 

EXITDF 

JMP 

* 

62 

ARG 

CON 

0 

63 

ARGX 

CON 

0 

64 

ACC 

CON 

0 

65 

ACCX 

CON 

0 

66 

EXPO 

CON 

0 


DOUBLE-PRECISION CALCULATIONS 

Now ACC has the sign of the answer. 

rll <- e v . 

rI2 <— e u . 
rll +■ e u Gy. 

Remove exponent. 

Scale right. 

0 Vi v 2 v 3 V4 

V 5 Vq V 7 Vs V9 

Store true sign in both halves. 

Remove exponent. 

U\ U2 U 3 U 4 U 5 

EXPO <— 1 (see below). 

1 u 5 ue u 7 u 8 

A trick, see comments in text. 

Add 0 t >5 v& v 7 v%. 

Recover from inserted 1. (Sign varies) 
Add most significant halves. 

(Overflow cannot occur) 

Normalization routine: 
f w in rAX, e w = EXPO ~f rI2. 

If fu> = 0, set e w 0. 

Normalize to left. 

Is the leading byte zero? 

(Rounding omitted) 

Compute final exponent. 

Is it negative? 

Is it more than two bytes? 


Bring answer into rA. 
Exit from subroutine. 


floating point accumulator 
Part of “raw exponent” | 
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When the least-significant halves are added together in this program, an 
extra digit “1” is inserted at the left of the word that is known to have the 
correct sign. After the addition, this byte can be 0, 1, or 2, depending on 
the circumstances, and all three cases are handled simultaneously in this way. 
(Compare this with the rather cumbersome method of complementation that is 
used in Program 4.2.1 A.) 

It is worth noting that register A can be zero after the instruction on line 40 
has been performed; and, because of the way MIX defines the sign of a zero result, 
the accumulator contains the correct sign that is to be attached to the result if 
register X is nonzero. If lines 39 and 40 were interchanged, the program would 
be incorrect, even though both instructions are “ADD”! 

Now let us consider double-precision multiplication. The product has four 
components, shown schematically in Fig. 4. Since we need only the leftmost eight 
bytes, it is convenient to work only with the digits to the left of the vertical line 
in the diagram, and this means in particular that we need not even compute the 
product of the two least-significant halves. 

Program M (Double-precision multiplication). The input and output conventions 


for this subroutine 

are the same as for Program A. 

01 

BYTE 

EQU 

1(4:4) 

Byte size 

02 

QQ 

EQU 

BYTE*BYTE/2 

Excess of double-precision exponent 

03 

DFMUL 

STJ 

EXITDF 

Double-precision multiplication: 

04 


STA 

TEMP 


05 


SLAX 

2 

Remove exponent. 

06 


STA 

ARG 

Urn 

07 


STX 

ARGX 

Vl 

08 


LDA 

TEMP(EXPD) 


09 


ADD 

ACC(EXPD) 


10 


STA 

EXPO 

EXPO *— e u 6v 

11 


ENT 2 

-QQ 

rI2 *-QQ. 

12 


LDA 

ACC 


13 


LDX 

ACCX 


U 


SLAX 

2 

Remove exponent. 

15 


STA 

ACC 

Um 

16 


STX 

ACCX 

Ui 

17 


MUL 

ARGX 

Wm X Vl 

18 


STA 

TEMP 


19 


LDA 

ARG(ABS) 


20 


MUL 

ACCX(ABS) 

\Vm X 

21 


SRA 

1 

0 x x x x 

22 


ADD 

TEMP(1:4) 

(Overflow cannot occur) 

23 


STA 

TEMP 


24 


LDA 

ARG 


25 


MUL 

ACC 

Vm X V m 

26 


STA 

TEMP(SIGN) 

Store true sign of result. 

27 


STA 

ACC 

Now prepare to add all the 

28 


STX 

ACCX 

partial products together. 
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u u u u u u u u 0 0 
v v v v v v v v 0 0 



x x x x x i 0 0 0 0 

X X X X 

x x x x 0 0 

X X X X 

x x x x 0 0 

T T T T T T T T T 

Jb JU JU Jb JU Jb Jb Jb 

x 

w w w w w w w w w 

w w w w w w w 0 0 0 0 


= Um J r m 

— Um — 1 “ ^U\ 

= e 2 u t x Vi 
= eu m X Vi 

— eui X v m 

— Um X U m 


Fig. 4. Double-precision multiplication of eight-byte fraction parts. 


29 

LDA 

ACCX(0:4) 

0 x x x x 

SO 

ADD 

TEMP 

(Overflow cannot occur) 

31 

SRAX 

4 


32 

ADD 

ACC 

(Overflow cannot occur) 

33 

JMP 

DNORM 

Normalize and exit. | 


Note the careful treatment of signs in this program, and note also the fact 
that the range of exponents makes it impossible to compute the final exponent 
using an index register. Program M is perhaps too slipshod in accuracy, since 
it throws away all the information to the right of the vertical line in Fig. 4; this 
can make the least significant byte as much as 2 in error. A little more accuracy 
can be achieved as discussed in exercise 4. 

Double-precision floating division is the most difficult routine, or at least the 
most frightening prospect we have encountered so far in this chapter. Actually, 
it is not terribly complicated, once we see how to do it; let us write the numbers 
to be divided in the form (u m eui)/(y m evi), where e is the reciprocal of 
the word size of the computer, and where v m is assumed to be normalized. The 
fraction can now be expanded as follows: 

Um + m _ Um + ZUi f 1 
Vm + m V m 

= W 1 _/iU £ M 

Um y \Um J \UmJ 

Since 0 < \vi\ < 1 and 1/6 < \v m \ < 1, we have \vi/v m \ < 6, and the error 
from dropping terms involving e 2 can be disregarded. Our method therefore is 
to compute w m = (u m and then to subtract e times w m ui/u m 

from the result. 

In the following program, lines 27-32 do the lower half of a double-precision 
addition, using another method for forcing the appropriate sign as an alternative 
to the trick of Program A. 

Program D ( Double-precision division). This program adheres to the same con¬ 
ventions as Programs A and M. 
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01 

DFDIV 

STJ 

EXITDF 

Double-precision division: 

02 


JOV 

OFLO 

Ensure overflow is off. 

03 


STA 

TEMP 


04 


SLAX 

2 

Remove exponent. 

05 


STA 

ARG 


06 


STX 

ARGX 

Vl 

07 


LDA 

ACC(EXPD) 


08 


SUB 

TEMP(EXPD) 


09 


STA 

EXPO 

EXPO <— e u — e v . 

10 


ENT2 

QQ+1 

rI2 <— QQ + 1. 

11 


LDA 

ACC 


12 


LDX 

ACCX 


13 


SLAX 

2 

Remove exponent. 

n 


SRAX 

1 

(Cf. Algorithm 4.2.1M) 

15 


DIV 

ARG 

If overflow, it is detected below. 

16 


STA 

ACC 

U)m 

17 


SLAX 

5 

Use remainder in further division. 

18 


DIV 

ARG 


19 


STA 

ACCX 

±Wi 

20 


LDA 

ARGX (1:4) 


21 


ENTX 

0 


22 


DIV 

ARG(ABS) 

rA [|l) 4 V{/v m |J/l> 5 . 

23 


JOV 

DVZROD 

Did division cause overflow? 

24 


MUL 

ACC(ABS) 

rAX <— \WmVi/bVm\, approximately. 

25 


SRAX 

4 

Multiply by b, and save 

26 


SLC 

5 

the leading byte in rX. 

27 


SUB 

ACCX(ABS) 

Subtract |tyz|. 

28 


DECA 

1 

Force minus sign. 

29 


SUB 

WM1 


30 


JOV 

*+2 

If no overflow, carry one more 

31 


INCX 

1 

to upper half. 

32 


SLC 

5 

(Now rA < 0) 

33 


ADD 

ACC(ABS) 

rA <- \w m \ — |rA|. 

34 


STA 

ACC(ABS) 

(Now rA > 0) 

35 


LDA 

ACC 

w m with correct sign 

36 


JMP 

DNORM 

Normalize and exit. 

37 

DVZROD 

HLT 

30 

Unnormalized or zero divisor 

38 

1H 

EQU 

1(1:1) 


39 

WM1 

CON 

IB-1,BYTE-1(1;1) 

Word size minus one | 


Here is a table of the approximate average computation times for these 
double-precision subroutines, compared to the single-precision subroutines that 
appear in Section 4.2.1: 



Single precision 

Double precision 

Addition 

45. 

84 u 

Subtraction 

49.5w 

88 u 

Multiplication 

48w 

109u 

Division 

52 u 

126.5u 
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For extension of the methods of this section to triple-precision floating point 
fraction parts, see Y. Ikebe, CACM 8 (1965), 175-177. 


EXERCISES 

1. [16] Try the double-precision division technique by hand, with e = y^,, when 
dividing 180000 by 314159. (Thus, let ( u m , u{) — (.180, .000) and (v m , Vi) = (.314, .159), 
and find the quotient using the method suggested in the text following (2).) 

2. [20] Would it be a good idea to insert the instruction “ENTX 0” between lines 30 
and 31 of Program M, in order to keep unwanted information left over in register X 
from interfering with the accuracy of the results? 

3. [M20] Explain why overflow cannot occur during Program M. 

4. [22] How should Program M be changed so that extra accuracy is achieved, 
essentially by moving the vertical line in Fig. 4 over to the right one position? Specify 
all changes that are required, and determine the difference in execution time caused by 
these changes. 

► 5. [24] How should Program A be changed so that extra accuracy is achieved, essen¬ 
tially by working with a nine-byte accumulator instead of an eight-byte accumulator 
to the right of the decimal point? Specify all changes that are required, and determine 
the difference in execution time caused by these changes. 

6. [23] Assume that the double-precision subroutines of this section and the single¬ 
precision subroutines of Section 4.2.1 are being used in the same main program. Write a 
subroutine that converts a single-precision floating point number into double-precision 
form (1), and write another subroutine that converts a double-precision floating point 
number into single-precision form (reporting exponent overflow or underflow if the 
conversion is impossible). 

► 7. [M30] Estimate the accuracy of the double-precision subroutines in this section, 
by finding bounds 8\, 82, and 83 on the relative errors 

|((u 0 u) — (u + v))/{u -F v)|, \((u ®v) — (ux v))/(u x u)|, 

\((u (2) v) — {u/v))/{u/v)\. 


8. [M28] Estimate the accuracy of the “improved” double-precision subroutines of 
exercises 4 and 5, in the sense of exercise 7. 

9. [M42] T. J. Dekker [Numer. Math. 18 (1971), 224-242] has suggested an alter¬ 
native approach to double precision, based entirely on single-precision floating binary 
calculations. For example, Theorem 4.2.2C states that u-\-v = w-\-r, where w = u©v 
and r = (u 0 w) 0 v, if |u| > jt>| and the radix is 2; here |r| < \w\/2 v , so the pair 
(w, r ) may be considered a double-precision version of u -(- v. To add two such pairs 
(u, v!) 0 (u, v'), where \u'\ < \u\/2 v and \v'\ < \v\f2 p and |u| > |v|, Dekker suggests 
computing u-\-v = w-\-r (exactly), then s — (r0v / )0ti / (an approximate remainder), 
and finally returning the value (w 0 s, (w © (w 0 s)) 0 s). 

Study the accuracy and efficiency of this approach when it is used recursively to 
produce quadruple-precision calculations. 
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Sy, Cu 

6 = 2 

6= 10 

6= 16 

6 = 64 

0 

0.33 

0.47 

0.47 

0.56 

1 

0.12 

0.23 

0.26 

0.27 

2 

0.09 

0.11 

0.10 

0.04 

3 

0.07 

0.03 

0.02 

0.02 

4 

0.07 

0.01 

0.01 

0.02 

5 

0.04 

0.01 

0.02 

0.00 

over 5 

0.28 

0.13 

0.11 

0.09 

average 

3.1 

0.9 

0.8 

0.5 


4.2.4. Distribution of Floating Point Numbers 

In order to analyze the average behavior of floating point arithmetic algorithms 
(and in particular to determine their average running time), we need some 
statistical information that allows us to determine how often various cases arise. 
The purpose of this section is to discuss the empirical and theoretical properties 
of the distribution of floating point numbers. 

A. Addition and subtraction routines. The execution time for a floating point 
addition or subtraction depends largely on the initial difference of exponents, and 
also on the number of normalization steps required (to the left or to the right). No 
way is known to give a good theoretical model that tells what characteristics to 
expect, but extensive empirical investigations have been made by D. W. Sweeney 
[IBM Systems J. 4 (1965), 31-42]. 

By means of a special tracing routine, Sweeney ran six “typical” large-scale 
numerical programs, selected from several different computing laboratories, and 
examined each floating addition or subtraction operation very carefully. Over 
250,000 floating point addition-sub tractions were involved in gathering this data. 
About one out of every ten instructions executed by the tested programs was 
either FADD or FSUB. 

Let us consider subtraction to be addition preceded by negating the second 
operand; therefore we may give all the statistics as if we were merely doing 
addition. Sweeney’s results can be summarized as follows: 

One of the two operands to be added was found to be equal to zero about 
9 percent of the time, and this was usually the accumulator (ACC). The other 
91 percent of the cases split about equally between operands of the same or of 
opposite signs, and about equally between cases where |tt| < |v| or |v| < \u\. 
The computed answer was zero about 1.4 percent of the time. 

The difference between exponents had a behavior approximately given by 
the probabilities shown in Table 1, for various radices 6. (The “over 5” line of 
that table includes essentially all of the cases when one operand was zero, but 
the “average” line does not include these cases.) 







4.2.4 


DISTRIBUTION OF FLOATING POINT NUMBERS 239 


Table 2 

EMPIRICAL DATA FOR NORMALIZATION AFTER ADDITION 



6—2 

6—10 

6—16 

6 = 64 

Shift right 1 

0.20 

0.07 

0.06 

0.03 

No shift 

0.59 

0.80 

0.82 

0.87 

Shift left 1 

0.07 

0.08 

0.07 

0.06 

Shift left 2 

0.03 

0.02 

0.01 

0.01 

Shift left 3 

0.02 

0.00 

0.01 

0.00 

Shift left 4 

0.02 

0.01 

0.00 

0.01 

Shift left >4 

0.06 

0.02 

0.02 

0.02 


When u and v have the same sign and are normalized, then u v either 
requires one shift to the right (for fraction overflow), or no normalization shifts 
whatever. When u and v have opposite signs, we have zero or more left shifts 
during the normalization. Table 2 gives the observed number of shifts required; 
the last line of that table includes all cases where the result was zero. The 
average number of left shifts per normalization was about 0.9 when b — 2; about 
0.2 when b = 10 or 16; and about 0.1 when b = 64. 

B. The fraction parts. Further analysis of floating point routines can be based on 
the statistical distribution of the fraction parts of randomly chosen normalized 
floating point numbers. In this case the facts are quite surprising, and there is an 
interesting theory that accounts for the unusual phenomena that are observed. 

For convenience let us temporarily assume that we are dealing with floating 
decimal (i.e., radix 10) arithmetic; modifications of the following discussion to 
any other positive integer base b will be very straightforward. Suppose we are 
given a “random” positive normalized number (e, /) = 10 e • /. Since / is 
normalized, we know that its leading digit is 1, 2, 3, 4, 5, 6, 7, 8, or 9, and 
it seems natural to assume that each of these nine possible leading digits will 
occur about one-ninth of the time. But, in fact, the behavior in practice is quite 
different. For example, the leading digit tends to be equal to 1 over 30 percent 
of the time! 

One way to test the assertion just made is to take a table of physical 
constants (e.g., the speed of light, the acceleration of gravity) from some standard 
reference. If we look at the Handbook of Mathematical Functions (U.S. Dept 
of Commerce, 1964), for example, we find that 8 of the 28 different physical 
constants given in Table 2.3, roughly 29 percent, have leading digit equal to 1. 
The decimal values of n! for 1 < n < 100 include exactly 30 entries beginning 
with 1; so do the decimal values of 2 n and of F n , for 1 < n < 100. We might 
also try looking at census reports, or a Farmer’s Almanack (but not a telephone 
directory). 

In the days before pocket calculators, the pages in well-used tables of loga¬ 
rithms tended to get quite dirty in the front, while the last pages stayed relatively 
clean and neat. This phenomenon was apparently first mentioned in print by the 
American astronomer Simon Newcomb [Amer. J. Math. 4 (1881), 39-40], who 
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gave good grounds for believing that the leading digit d occurs with probability 
log 10 (l -f- 1/d). The same distribution was discovered empirically, many years 
later, by Frank Benford, who reported the results of 20,229 observations taken 
from different sources [ Proc . Amer. Philosophical Soc. 78 (1938), 551-572]. 

In order to account for this leading-digit law, let’s take a closer look at 
the way we write numbers in floating point notation. If we take any positive 
number u, its leading digits are determined by the value (log 10 w)mod 1: The 
leading digit is less than d if and only if 

(l°g 10 u ) mod 1 < log 10 d, (1) 

since 10/ u = io( lo gio")™odi_ 

Now if we have a “random” positive number U, chosen from some reasonable 
distribution that might occur in nature, we might expect that (log 10 C/)mod 1 
would be uniformly distributed between zero and one, at least to a very good 
approximation. (Similarly, we expect U modi, U 2 modi, y/U + firnod 1, etc., 
to be uniformly distributed. We expect a roulette wheel to be unbiased, for 
essentially the same reason.) Therefore by (1) the leading digit will be 1 with 
probability log 10 2 ph 30.103 percent; it will be 2 with probability log 10 3 — 
log 10 2 ph 17.609 percent; and, in general, if r is any real value between 1 and 
10, we ought to have 10 fu < r approximately log 10 r of the time. 

Another way to explain this law is to say that a random value U should 
appear at a random point on a slide rule, according to the uniform distribution, 
since the distance from the left end of a slide rule to the position of U is propor¬ 
tional to (log 10 C7)mod 1. The analogy between slide rules and floating point 
calculation is very close when multiplication and division are being considered. 

The fact that leading digits tend to be small is important to keep in mind; 
it makes the most obvious techniques of “average error” estimation for floating 
point calculations invalid. The relative error due to rounding is usually a little 
more than expected. 

Of course, it may justly be said that the heuristic argument above does not 
prove the stated law. It merely shows us a plausible reason why the leading digits 
behave the way they do. An interesting approach to the analysis of leading digits 
has been suggested by R. Hamming: Let p(r) be the probability that 10 fu < r, 
where 1 < r < 10 and fu is the normalized fraction part of a random normalized 
floating point number U. If we think of random quantities in the real world, we 
observe that they are measured in terms of arbitrary units; and if we were to 
change the definition of a meter or a gram, many of the fundamental physical 
constants would have different values. Suppose then that all of the numbers in the 
universe are suddenly multiplied by a constant factor c; our universe of random 
floating point quantities should be essentially unchanged by this transformation, 
so p(r) should not be affected. 

Multiplying everything by c has the effect of transforming (log 10 U) mod 1 
into (\og 10 U log 10 c)modl. It is now time to set up formulas that describe 
the desired behavior; we may assume that 1 < c < 10. By definition, 

p{r) = probability that (log 10 U) mod 1 < log 10 r. 
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By our assumption, we should also have 



probability that (log 10 U -f- log 10 c ) m °d 1 < log 10 r 
' probability that (log 10 U mod 1) < log 10 r — log 10 c 

or (log 10 U mod 1) > 1 — log 10 c, if c < r; 

< 

probability that (log 10 U mod 1) < log 10 r -f- 1 — log 10 c 
< and (log 10 U mod 1) > 1 — log 10 c, if c > r; 


p(r/c) + 1 — p(10/c), 
p(10r/c) — p(10/c), 


if c < r; 
if c > r. 



Let us now extend the function p(r) to values outside the range 1 < r < 10, 
by defining p(10 n r) = p(r) -|- n; then if we replace 10/c by d, the last equation 
of (2) may be written 

P(rd) = p(r) + p{d). (3) 

If our assumption about invariance of the distribution under multiplication by a 
constant factor is valid, then Eq. (3) must hold for all r > 0 and 1 < d < 10. 
The facts that p(l) = 0, p(10) = 1 now imply that 

1 — p(10) = p((^10 ) n ) — p(\?To) + p((^10 = • • • = np(\fl 0); 

hence we deduce that p( 10 m ^ n ) — m/n for all positive integers m and n. If 
we now decide to require that p is continuous, we are forced to conclude that 
p(r) = log 10 r, and this is the desired law. 

Although this argument may be more convincing than the first one, it doesn’t 
really hold up under scrutiny if we stick to conventional notions of probability. 
The traditional way to make the above argument rigorous is to assume that 
there is some underlying distribution of numbers F(u) such that a given positive 
number JJ is < u with probability F(u); then the probability of concern to us is 

P(r) = J^(F(10 m r) - F(10 m )), (4) 

m 

summed over all values — oo < m < oo. Our assumptions about scale invariance 
and continuity have led us to conclude that 


p(r) = log 10 r. 


Using the same argument, we could “prove” that 

£ (F(b m r) - F(b m )) = log 6 r, (5) 

m 

for each integer b > 2, when 1 < r < b. But there is no distribution function F 
that satisfies this equation for all such b and r! (See exercise 7.) 




242 ARITHMETIC 


4.2.4 


One way out of the difficulty is to regard the logarithm law p(r) = log 10 r as 
only a very close approximation to the true distribution. The true distribution 
itself may perhaps be changing as the universe expands, becoming a better and 
better approximation as time goes on; and if we replace 10 by an arbitrary 
base b, the approximation might be less accurate (at any given time) as b gets 
larger. Another rather appealing way to resolve the dilemma, by abandoning the 
traditional idea of a distribution function, has been suggested by R. A. Raimi, 
AMM 76 (1969), 342-348. 

The hedging in the last paragraph is probably a very unsatisfactory explana¬ 
tion, and so the following further calculation (which sticks to rigorous mathe¬ 
matics and avoids any intuitive, yet paradoxical, notions of probability) should 
be welcome. Let us consider the distribution of the leading digits of the positive 
integers, instead of the distribution for some imagined set of real numbers. The 
investigation of this topic is quite interesting, not only because it sheds some 
light on the probability distributions of floating point data, but also because 
it makes a particularly instructive example of how to combine the methods of 
discrete mathematics with the methods of infinitesimal calculus. 

In the following discussion, let r be a fixed real number, 1 < r < 10; 
we will attempt to make a reasonable definition of p(r), the “probability” that 
the representation 10 eN • f N of a “random” positive integer N has 10/n < r, 
assuming infinite precision. 

To start, let us try to find the probability using a limiting method like the 
definition of “Pr” in Section 3.5. One nice way to rephrase that definition is to 
define 

( 1, if n = 10 e * / where 10/ < r, 

Po(n) = < i.e., if (log 10 n) mod 1 < log 10 r; (6) 

10, otherwise. 

Now P 0 (l), Po(2), ... is an infinite sequence of zeros and ones, with ones to 
represent the cases that contribute to the probability we are seeking. We can 
try to “average out” this sequence, by defining 

a(») = 1 E p o ( k )• a) 

71 l<k<n 

Thus if we generate a random integer between 1 and n using the techniques of 
Chapter 3, and convert it to floating decimal form (e, /), the probability that 
10/ < r is exactly Pi(n). It is natural to let lim n _,oo Pi(n) be the “probability” 
p(r) we are after, and that is just what we did in Section 3.5. 

But in this case the limit does not exist: For example, let us consider the 
subsequence 

P 1 (s), P\ (10s), Pi(100a), ..., Pi(10 n s), 
where s is a real number, 1 < s < 10. If s < r, we find that 
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Pi{lO n s) = - 1 + flOrl - 10 + '• • + flO-M 

— 10 n_1 + [10"sJ + 1 — 10") 

= -L(r(l + 10 + • • ■ + 10"- 1 ) + 0{n) 

+ [10" sj - 1 - 10-10") 

= Tob^ (1 ° nr ~ lon+1) + L1 ° ns - 1 + 0(n) )- (8) 


As n —► oo, Pi(10 n s) therefore approaches the limiting value 1 -)- (r — 10)/9s. 
The above calculation for the case s < r can be modified so that it is valid for 
s > r if we replace [10 n s] -j- 1 by [ 10 n r"]; when s > r, we therefore obtain 
the limiting value 10(r — l)/9s. [See J. Franel, Naturforschende Gesellschaft, 
Vierteljahrsschrift 62 (Zurich, 1917), 286-295.] 

In other words, the sequence (Pi(n)) has subsequences (Pi(10 n s)) whose limit 
goes from (r — l)/9 up to 10(r — l)/9r and down again to (r — l)/9, as s goes 
from 1 to r to 10. We see that Pi(n) has no limit as n — > oo; and the values of 
Pi(n) for large n are not particularly good approximations to our conjectured 
limit log 10 r either! 

Since Pi(n) doesn’t approach a limit, we can try to use the same idea as (7) 
once again, to “average out” the anomalous behavior. In general, let 

P m+1 (n)=i £ P m (k). (9) 

11 l<k<n 

Then P m +i(n) will tend to be a more well-behaved sequence than P m (n). Let us 
try to confirm this with quantitative calculations; our experience with the special 
case m = 0 indicates that it might be worthwhile to consider the subsequence 
P m+ i(10 n s). The following results can, in fact, be derived: 

Lemma Q. For any integer m > 1 and any real number e > 0, there are 
functions Q m (s), R m (s) and an integer N m (e), such that whenever n > N rn (e) 
and 1 < s < 10, we have 


|P m (10 n s) — Q m {s)\ < e, if s < r; 

|P m (10 n s) — (Qm{s) + Rm{s)) | < e, if S > T. 

Furthermore the functions Q m [s) and R m {s) satisfy the relations 

Qm(s) — Qm—R m —i(t)dt'j; 

1 

Rm($) =r / Rm —l (t^dt, 

S Jr 

Qo(s) — 1, R 0 (s) = —1. 


( 10 ) 


(11) 
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Proof. Consider the functions Q m (s) and R m (s) defined by (11), and let 


S m {t) = 


Qm{t )> t ^ T . 

Qm(t ) t > T . 


We will prove the lemma by induction on m. 

First note that Qi(s) = (l + (s — 1) + (r — 10)/9)/s = 1 + (r — 10)/9s, 
and Ri(s) = (r — s)/s. From (8) we find that |Pi(10 n s) — Si(s)\ = 0(n)/ 10 n ; 
this establishes the lemma when m — 1. 

Now for m > 1, we have 




1(P 

10^<fc<10J+ 1 


0/m-iW+ ^2 

10 n < fc< 10 n s ' 


and we want to approximate this quantity. By induction, the difference 


Rm —1(^) 


'l(P <fc<l(P<j 


1(P </c<l(Pq 




is less than qe when 1 < q < 10 and j > iV m _i(e). Since S m —i(t) is continuous, 
it is a Riemann-integrable function; and the difference 


10 J <fc<l(P<7 


1 c ( k 


yi: 


S m —i{t) dt\ 


is less than e for all j greater than some number N, independent of q , by the 
definition of integration. We may choose N to be > Therefore for 

n > TV, the difference 

P m (10 n s)-i( l Sm-i(t)dt + J^ S m _,(t)dt) (15) 

is bounded by Eo^jxatW 10 " -3 ) + Ejv<,<«( Uf / r_i )+ lie, if M is an 
upper bound for (13) -j- (14) that is valid for all positive integers j. Finally, the 
sum Eo<j'<n(l/10 n— which appears in (15), is equal to (1 — l/10 n )/9; so 


Rm(10 n< $) s^9 fi Sm—i(t)dt-\- J S m —i(t)dt^ 


can be made smaller than, say, 20e, if n is taken large enough. Comparing this 
with (10) and (11) completes the proof. | 
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Fig. 5. The probability that the leading digit is 1. 


The gist of Lemma Q is that we have the limiting relationship 

lim P m (10 n s) = S m (s). (16) 

n—*-oo 

Also, since S m (s) is not constant as s varies, the limit 

lim P m (n) 

n—* oo 

(which would be our desired “probability”) does not exist for any m. The 
situation is shown in Fig. 5, which shows the values of S m (s) when m is small 
and r — 2. 

Even though S m (s) is not a constant, so that we do not have a definite limit 
for P m (n), note that already for m = 3 in Fig. 5 the value of S m (s) stays very 
close to log 10 2 = 0.30103.... Therefore we have good reason to suspect that 
S m (s) is very close to log 10 r for all large m, and, in fact, that the sequence of 
functions (S m (s)) converges uniformly to the constant function log 10 r. 

It is interesting to prove this conjecture by explicitly calculating Q m (s) and 
R m (s) for all m, as in the proof of the following theorem: 

Theorem F. Let S m {$) he the limit defined in (16). For all e > 0, there exists 
a number N(e) such that 

|S m (s) — logj 0 r| < €, fori < s < 10, (IT) 

whenever m > AT(e). 

Proof. In view of Lemma Q, we can prove this result if we can show that there 
is a number M depending on e such that, for 1 < s < 10 and for all m > M, 
we have 


IQmM — >°gl0 r l < £ and |R m (s)| < £• 


( 18 ) 
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It is not difficult to solve the recurrence formula (11) for R m : We have 
R 0 (s) = —1, Ri(s) = —1 + r/s, i? 2 (s) = —1 + (r/s)(l + ln(s/r)), and in 
general 

Rm{$) = -! + I(l + i:In © + •■•■+ ~(ln ©P) (19) 

For the stated range of s, this converges uniformly to 

—1 -j- (r/s)exp(ln(s/r)) = 0. 

The recurrence (11) for Q m takes the form 

Qm{s) — H” 1 H - Qm—l(i) dt\ f (20) 


where 


= \[f l Qm—l(t)dt + £ Rm-itydtj — l. 


The solution to recurrence (20) is easily found by trying out the first few cases 
and guessing at a formula that can be proved by induction; we find that 

Qm(s) = 1 + ±(cm + Yi c m -1 In s -f-f- ^^-jJjCiflns) m-1 j. (22) 

It remains for us to calculate the coefficients c m , which by (19), (21), and 
(22) satisfy the relations 

ci = (r — 10)/9; 

Cm+i = In 10 + — c m __!(ln 10) 2 H -f- — Ci(ln 10) m ^ 

+ r ( 1 + T! ln 7 + '" + i( ln 7) ) -10 } 

This sequence appears at first to be very complicated, but actually we can analyze 
it without difficulty with the help of generating functions. Let 

C(z) = C\Z -f- C2 ^ 2 -f- C3Z 3 -(-•••; 

then since 10 2 = 1 + ^ln 10 + (l/2!)(zln 10) 2 -|-, we deduce that 

_ 1 ,9 

Cm+l — 1 “r 

= J^(cm+1 7 c m l n 19 + • • • + ^yCi(ln 10) m ) 


+ 7S 1 + - + 
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is the coefficient of 2 m+1 in the function 






(24) 


This condition holds for all values of m, so (24) must equal C(z), and we obtain 
the explicit formula 

—2 ((lO/r)*- 1 — 1\ 
l — z\ 10^— 1 — 1 )' 




We want to study asymptotic properties of the coefficients of C(z), to complete 
our analysis. The large parenthesized factor in (25) approaches ln(10/r)/ In 10 = 
1 — log 10 r as 2 —► 1, so we see that 

C(z) + - ~ l0gl ° r = R(z) (26) 

1 — z 

is an analytic function of the complex variable z in the circle 


A < 


1 + 


2m 

InlO 


In particular, R(z) converges for 2 = 1, so its coefficients approach zero. This 
proves that the coefficients of C(z) behave like those of (log 10 r — 1)/(1 — z), 
that is, 

lim c m = log 10 r — 1. 

m—*• 00 

Finally, we may combine this with (22), to show that Q m (s) approaches 
1 + l£gio^-1^1 + Ins + t(lns) 2 -+-j = log ]0 r 


uniformly for 1 < s < 10. | 

Therefore we have established the logarithmic law for integers by direct 
calculation, at the same time seeing that it is an extremely good approximation 
to the average behavior although it is never precisely achieved. 

The above proofs of Lemma Q and Theorem F are slight simplifications and 
amplifications of methods due to B. J. Flehinger, AMM 73 (1966), 1056-1061. 
Many authors have written about the distribution of initial digits, showing that 
the logarithmic law is a good approximation for many underlying distributions; 
see the survey by Ralph A. Raimi, AMM 83 (1976), 521-538, for a comprehensive 
review of the literature. Another interesting (and different) treatment of floating 
point distribution has been given by Alan G. Konheim, Math. Comp. 19 (1965), 
143-144. 

Exercise 17 discusses an approach to the definition of probability under 
which the logarithmic law holds exactly, over the integers. Furthermore, ex¬ 
ercise 18 demonstrates that any reasonable definition of probability over the 
integers must lead to the logarithmic law, if it assigns a value to the probability 
of leading digits. 
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EXERCISES 

1. [13] Given that u and v are nonzero floating decimal numbers with the same sign, 
what is the approximate probability that fraction overflow occurs during the calculation 
of u ® v, according to Tables 1 and 2? 

2. [42] Make further tests of floating point addition and subtraction, to confirm or 
improve on the accuracy of Tables 1 and 2. 

3. [15] What is the probability that the two leading digits of a floating decimal 
number are “23”, according to the logarithmic law? 

4. [Ml 8] The text points out that the front pages of a well-used table of logarithms 
get dirtier than the back pages do. What if we had an antilogarithm table instead, i.e., 
a table giving the value of x when log 10 x is given; which pages of such a table would 
be the dirtiest? 

► 5. [ M20] Let U be a random real number that is uniformly distributed in the interval 
0 < U <1. What is the distribution of the leading digits of U ? 

6. [28] If we have binary computer words containing n + 1 bits, we might use p 
bits for the fraction part of floating binary numbers, one bit for the sign, and n — p 
bits for the exponent. This means that the range of values representable, i.e., the 
ratio of the largest positive normalized value to the smallest, is essentially 2 2n P . The 
same computer word could be used to represent floating hexadecimal numbers, i.e., 
floating point numbers with radix 16, with p + 2 bits for the fraction part ((p -f 2)/4 
hexadecimal digits) and n — p — 2 bits for the exponent; then the range of values 

O n — p —- 2 — P » , 

would be 16 =2 , the same as before, and with more bits in the fraction 

part. This may sound as if we are getting something for nothing, but the normalization 
condition for base 16 is weaker in that there may be up to three leading zero bits in 
the fraction part; thus not all of the p~ f- 2 bits are “significant.” 

On the basis of the logarithmic law, what are the probabilities that the fraction 
part of a positive normalized radix 16 floating point number has exactly 0, 1, 2, and 3 
leading zero bits? Discuss the desirability of hexadecimal versus binary. 

7. [HM28] Prove that there is no distribution function F{u ) that satisfies (5) for 
each integer b > 2, and for all real values r in the range 1 < r < b. 

8. [HM23] Does (10) hold when m — 0 for suitable JVo(e)? 

9. [ HM24 ] (P. Diaconis.) Let P\{ri), P 2 (n), ... be any sequence of functions defined 
by repeatedly averaging a given function P 0 {n ) according to Eq. (9). Prove that 
lim m —00 Pm{n) — P 0 (l) for all fixed n. 

► 10. [HM28] The text shows that c m = log 10 r — 1 + e m , where e m approaches zero 
as m —► 00 . Obtain the next term in the asymptotic expansion of c m . 

11. [ M15] Given that U is a random variable distributed according to the logarithmic 
law, prove that 1/U is also. 

12. [HM25] (R. W. Hamming.) The purpose of this exercise is to show that the result 
of floating point multiplication tends to obey the logarithmic law more perfectly than 
the operands do. Let U and V be random, normalized, positive floating point numbers, 
whose fraction parts are independently distributed with the respective density functions 
f(x) and g(x). Thus, f u < r and f v < s with probability f* /b f* /b f(x)g(y)dydx, for 
1/6 < r,s < 1. Let h(x) be the density function of the fraction part of U X V 
(unrounded). Define the abnormality A(f) of a density function / to be the maximum 
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relative error, 

f{x) — l{x) 
l(x) ’ 

where l(x) = 1 j[x In 6) is the density of the logarithmic distribution. 

Prove that A(h ) < min (A(f),A(gj). (in particular, if either factor has logarithmic 
distribution the product does also.) 

► 13. [M20] The floating point multiplication routine, Algorithm 4.2.1M, requires zero 
or one left shifts during normalization, depending on whether f u f v > 1/6 or not. 
Assuming that the input operands are independently distributed according to the 
logarithmic law, what is the probability that no left shift is needed for normalization 
of the result? 

► 14. [HM30] Let U and V be random, normalized, positive floating point numbers 
whose fraction parts are independently distributed according to the logarithmic law, 
and let p k be the probability that the difference in their exponents is k. Assuming that 
the distribution of the exponents is independent of the fraction parts, give an equation 
for the probability that “fraction overflow” occurs during the floating point addition of 
U ® V, in terms of the base 6 and the quantities p 0 , pi, p 2 , • • • . Compare this result 
with exercise 1. (Ignore rounding.) 

15. [ HM28 ] Let U, V , po, p±, ... be as in exercise 14, and assume that radix 10 
arithmetic is being used. Show that regardless of the values of po, pi, p 2 , ..., the 
sum 1/ ® V will not obey the logarithmic law exactly, and in fact the probability that 
U ® V has leading digit 1 is always strictly less than log 10 2. 

16. \HM28 ] (P. Diaconis.) Let Po(n) be 0 or 1 for each n, and define “probabilities” 
P m+1 (n) by repeated averaging, as in (9). Show that if lim n ->oo Pi(n) does not exist, 
neither does lim n _>oo P m (n) for any m. [Hint: Prove that a n —► 0 whenever we have 
(ai —[— 1- a n )/n —* 0 and u n +i < a n -)- M/n, for some fixed constant M > 0.] 

► 17. [ HM25] (R. L. Duncan.) Another way to define the value of Pr(5(n)) is to 
evaluate the quantity \im n -,oo{(^2 s ^ and 1<k<n 1 /k)/H n ); it can be shown that this 

“harmonic probability” exists and is equal to Pr(5(n)), whenever the latter exists 
according to Definition 3.5A. Prove that the harmonic probability of the statement 
“(log 10 n) mod 1 < r” exists and equals r. (Thus, initial digits of integers exactly satisfy 
the logarithmic law in this sense.) 

► 18. [HM30] Let P(S) be any real-valued function defined on sets S of positive integers, 
but not necessarily on all such sets, satisfying the following rather weak axioms: 

i) If P(S) and P(T) are defined and S n T = 0, then P{S U T) = P{S) + P(T). 

ii) If P{S) is defined, then P(S + 1) = P(S), where —)- 1 ^ { rz -(— 1 | rz G ^ }• 

iii) If P{S) is defined, then P(2S) = %P{S), where 25 = { 2n j n £ S }. 

iv) If S is the set of all positive integers, then P(S) = 1. 

v) If P(S) is defined, then P{S) > 0. 

Assume furthermore that P(L a ) is defined for all positive integers a, where L a is the 
set of all integers whose decimal representation begins with a: 

L a = { n | 10 m a < n < 10 m (a 4- 1) for some integer m }. 

(In this definition, m may be negative; for example, 1 is an element of Lio, but not of 
Ln.) Prove that P(L a ) = log 10 (l -f 1 /a) for all integers a > 1. 


A(f) = max 

1/6<x<1 
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4.3. MULTIPLE-PRECISION ARITHMETIC 

Let US NOW consider operations on numbers that have arbitrarily high pre¬ 
cision. For simplicity in exposition, we shall assume that we are working with 
integers, instead of with numbers that have an embedded radix point. 


4.3.1. The Classical Algorithms 

In this section we shall discuss algorithms for 

a) addition or subtraction of n-place integers, giving an n-place answer and a 
carry; 

b) multiplication of an n-place integer by an m-place integer, giving an (m -f- 
n)-place answer; 

c) division of an (m -f- n)-place integer by an n-place integer, giving an (ra-j-1)- 
place quotient and an n-place remainder. 

These may be called “the classical algorithms,” since the word “algorithm” was 
used only in connection with these processes for several centuries. The term “n- 
place integer” means any integer less than b n , where b is the radix of ordinary 
positional notation in which the numbers are expressed; such numbers can be 
written using at most n “places” in this notation. 

It is a straightforward matter to apply the classical algorithms for integers 
to numbers with embedded radix points or to extended-precision floating point 
numbers, in the same way that arithmetic operations defined for integers in MIX 
are applied to these more general problems. 

In this section we shall study algorithms that do operations (a), (b), and (c) 
above for integers expressed in radix b notation, where b is any given integer 
> 2. Thus the algorithms are quite general definitions of arithmetic processes, 
and as such they are unrelated to any particular computer. But the discussion 
in this section will also be somewhat machine-oriented, since we are chiefly con¬ 
cerned with efficient methods for doing high-precision calculations by computer. 
Although our examples are based on the mythical MIX, essentially the same 
considerations apply to nearly every other machine. For convenience, we shall 
assume first that we have a computer (like MIX) that uses the signed-magnitude 
representation for numbers; suitable modifications for complement notations are 
discussed near the end of this section. 

The most important fact to understand about extended-precision numbers 
is that they may be regarded as numbers written in radix w notation, where 
w is the computer’s word size. For example, an integer that fills 10 words on 
a computer whose word size is w = 10 10 has 100 decimal digits; but we will 
consider it to be a 10-place number to the base 10 10 . This viewpoint is justified 
for the same reason that we may convert, say, from binary to octal notation, 
simply by grouping the bits together. (See Eq. 4.1-5.) 

In these terms, we are given the following primitive operations to work with: 
ao) addition or subtraction of one-place integers, giving a one-place answer and 
a carry; 
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bo) multiplication of a one-place integer by another one-place integer, giving a 
two-place answer; 

Co) division of a two-place integer by a one-place integer, provided that the 
quotient is a one-place integer, and yielding also a one-place remainder. 

By adjusting the word size, if necessary, nearly all computers will have these three 
operations available; so we will construct algorithms (a), (b), and (c) mentioned 
above in terms of the primitive operations (ao), (bo), and (co). 

Since we are visualizing extended-precision integers as base 5 numbers, it is 
sometimes helpful to think of the situation when b = 10, and to imagine that 
we are doing the arithmetic by hand. Then operation (ao) is analogous to mem¬ 
orizing the addition table; (bo) is analogous to memorizing the multiplication 
table; and (co) is essentially memorizing the multiplication table in reverse. 
The more complicated operations (a), (b), (c) on high-precision numbers can 
now be done using the simple addition, subtraction, multiplication, and long- 
division procedures we are taught in elementary school. In fact, most of the 
algorithms we shall discuss in this section are essentially nothing more than 
mechanizations of familiar pencil-and-paper operations. Of course, we must state 
the algorithms much more precisely than they have ever been stated in the fifth 
grade, and we should also attempt to minimize computer memory and running 
time requirements. 

To avoid a tedious discussion and cumbersome notations, let us assume that 
all numbers we deal with are nonnegative. The additional work of computing 
the signs, etc., is quite straightforward, and the reader will find it easy to fill in 
any details of this sort. 

First comes addition, which of course is very simple, but it is worth studying 
since the same ideas occur in the other algorithms also: 

Algorithm A ( Addition of nonnegative integers ). Given nonnegative n-place 
integers (u 1 U 2 ... u n \ and {y\V 2 - ■ • v n )b, this algorithm forms their radix-5 sum, 
(■ W 0 W 1 W 2 ... w n )b- (Here wq is the “carry,” and it will always be equal to 0 or 1.) 

Al. [Initialize.] Set j <— n, k <— 0. (The variable j will run through the various 
digit positions, and the variable k keeps track of carries at each step.) 

A2. [Add digits.] Set w 3 <— (Uj -}- v 3 -f- k)modb, and k <— [(u 3 v 3 -|- k)/b\. 
(In other words, k is set to 1 or 0, depending on whether a “carry” occurs 
or not, i.e., whether u 3 -f- v 3 -f- k > b or not. At most one carry is possible 
during the two additions, since we always have 

Uj -j - Vj -j- k < (b — 1) -]- (5 — 1) -j-1 < 25, 

by induction on the computation.) 

A3. [Loop on j.] Decrease j by one. Now if j > 0, go back to step A2; otherwise 
set wq *- k and terminate the algorithm. | 

For a formal proof that Algorithm A is a valid, see exercise 4. 
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A MIX program for this addition process might take the following form: 

Program A (Addition of nonnegative integers). Let LOC('Uj) = U -\-j, LQC(vj) = 
V -|- j, LO C(wj) = W + j, rll = j, tA = k, word size = 6, N = n. 


01 


ENT1 

N 

1 

Al. Initialize, i <— n. 

02 


JOV 

0FL0 

1 

Ensure overflow is off. 

08 

1H 

ENTA 

0 

N + l — K 

k «— 0. 

04 


J1Z 

3F 

N + l-K 

To A3 if j — 0. 

05 

2H 

ADD 

U.l 

N 

A2. Add disits. 

06 


ADD 

V,1 

N 


07 


STA 

W,1 

N 


08 


DEC1 

1 

N 

A3. Loop on ?. 

09 


JNOV 

IB 

N 

If no overflow, set k <— 0. 

10 


ENTA 

1 

K 

Otherwise, set k «- 1 . 

11 


J1P 

2B 

K 

To A2 if j yA 0. 

12 

3H 

STA 

W 

1 

Store final carry in wq. | 


The running time for this program is lOA/'-f 6 cycles, independent of the number 
of carries, K. The quantity K is analyzed in detail at the close of this section. 

Many modifications of Algorithm A are possible, and only a few of these are 
mentioned in the exercises below. A chapter on generalizations of this algorithm 
might be entitled “How to design addition circuits for a digital computer.” 

The problem of subtraction is similar to addition, but the differences are 
worth noting: 

Algorithm S (Subtraction of nonnegative integers ). Given nonnegative n-place 
integers (u\u 2 .. .u n )^ > {v\v 2 ... v n )b, this algorithm forms their nonnegative 
radix-6 difference, (w\w 2 ... w n )&. 

51. [Initialize.] Set j <— n, k <— 0. 

52. [Subtract digits.] Set w 3 «— (uj — Vj -f- k) mod 6, and k <— [( u 3 — v 3 -f- k)/b\. 
(In other words, k is set to —1 or 0, depending on whether a “borrow” 
occurs or not, i.e., whether Uj — v 3 -f- k < 0 or not. In the calculation of 
Wj, note that we must have —6 = 0 — (6 — 1) -j- (—1) < Uj —Vj-\-k < 
(6 — 1) — 0 -j- 0 < 6; hence 0 < Uj — v 3 -f- k + b < 26, and this suggests 
the method of computer implementation explained below.) 

53. [Loop on j.} Decrease j by one. Now if j > 0, go back to step S2; otherwise 

terminate the algorithm. (When the algorithm terminates, we should have 
k = 0; the condition k = —1 will occur if and only if v \... v n > U \... u n , 
and this is contrary to the given assumptions. See exercise 12.) | 

In a MIX program to implement subtraction, it is most convenient to retain 
the value 1 -|-k instead of k throughout the algorithm, so that we can calculate 
u j — V j + (1 + k) + (6 — 1) in step S2. (Recall that 6 is the word size.) This 
is illustrated in the following code. 
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Program S (Subtraction of nonnegative integers ). This program is analogous 
to the code in Program A; we have rll ~ j, rA = 1 -f- k. Here, as in other 
programs of this section, location WM1 contains the constant 6 — 1, the largest 
possible value that can be stored in a MIX word; cf. Program 4.2.3D, lines 38-39. 


01 


ENT1 

N 

1 

02 


JOV 

0FL0 

1 

03 

1H 

J1Z 

DONE 

K + 1 

04 


ENTA 

1 

K 

05 

2H 

ADD 

U,1 

N 

06 


SUB 

V,1 

N 

07 


ADD 

WM1 

N 

08 


STA 

W,1 

N 

09 


DEC1 

1 

N 

10 


JOV 

IB 

N 

11 


ENTA 

0 

N — K 

12 


J1P 

2B 

N — K 

13 


HLT 

5 



51. Initialize, j <— n. 
Ensure overflow is off. 
Terminate if j = 0. 

Set k *— 0. 

52. Subtract disits. 
Compute Uj — v 3 k -\-b. 

(May be minus zero) 

53. Loop on j. 

If overflow, set k <— 0. 

Otherwise, set k < -1. 

Back to S2. 

(Error, v > u) | 


The running time for this program is 12iV 3 cycles, slightly longer than the 
corresponding amount for Program A. 

The reader may wonder if it would not be worthwhile to have a combined 
addition-subtraction routine in place of the two algorithms A and S. But an 
examination of the computer programs shows that it is generally better to use two 
different routines, so that the inner loops of the computations can be performed 
as rapidly as possible, since the programs are so short. 

Our next problem is multiplication, and here we carry the ideas used in 
Algorithm A a little further: 

Algorithm M ( Multiplication of nonnegative integers). Given nonnegative in¬ 
tegers (U 1 U 2 .. .u n )b and (U 1 U 2 ... u m )b, this algorithm forms their radix-b product 
( W 1 W 2 ... te m 4 - n )b. (The conventional pencil-and-paper method is based on form¬ 
ing the partial products (U 1 U 2 ... u n ) X Vj first, for 1 < j < m, and then adding 
these products together with appropriate scale factors; but in a computer it is 
best to do the addition concurrently with the multiplication, as described in this 
algorithm.) 

MX. [Initialize.] Set tu m _|_i, w m + 2 , .. ., w m +n all to zero. Set j m. (If w m . p lf 
..., w m +n were not cleared to zero in this step, it turns out that the steps 
below would set 


(lei . . . Wm-\-n)b * (^T • • • ^n)6 X (Uj . . . V m )b “b (ie w _)_i . . . U) m ~ j—)fci• 

This more general operation is sometimes useful.) 

M2. [Zero multiplier?] If v 3 — 0, set Wj 0 and go to step M6. (This test 
saves a good deal of time if there is a reasonable chance that Vj is zero, but 
otherwise it may be omitted without affecting the validity of the algorithm.) 
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Table 1 


MULTIPLICATION OF 914 BY 84. 


Step 

i 

3 

Ui 

Vj 

t 

Wi 

W2 

W 3 

U >4 

w 5 

M5 

3 

2 

4 

4 

16 

X 

X 

0 

0 

6 

M5 

2 

2 

1 

4 

05 

X 

X 

0 

5 

6 

M5 

1 

2 

9 

4 

36 

X 

X 

6 

5 

6 

M6 

0 

2 

X 

4 

36 

X 

3 

6 

5 

6 

M5 

3 

1 

4 

8 

37 

X 

3 

6 

7 

6 

M5 

2 

1 

1 

8 

17 

X 

3 

7 

7 

6 

M5 

1 

1 

9 

8 

76 

X 

6 

7 

7 

6 

M6 

0 

1 

X 

8 

76 

7 

6 

7 

7 

6 


M3. [Initialize i.] Set i <— n, k <— 0. 

M4. [Multiply and add.] Set t <- Ui X Vj + w»+j 4~ A;; then set w i+ j «- t mod 5 
and fc <— [t/b\. (Here the “carry” k will always be in the range 0 < k < 5; 
see below.) 

M5. [Loop on i.) Decrease i by one. Now if i > 0, go back to step M4; otherwise 
set 1 Vj *- k. 

M6. [Loop on j] Decrease j by one. Now if j > 0, go back to step M2; otherwise 
the algorithm terminates. | 

Algorithm M is illustrated in Table 1, assuming that b = 10, by showing 
the states of the computation at the beginning of steps M5 and M6. A proof of 
Algorithm M appears in the answer to exercise 14. 

The two inequalities 

0 < t < b 2 , 0 < k < b (1) 

are crucial for an efficient implementation of this algorithm, since they point out 
how large a register is needed for the computations. These inequalities may be 
proved by induction as the algorithm proceeds, for if we have k < b at the start 
of step M4, we have 

Ui X Vj + Wi+j -f k < {b — 1) x (6 — 1) + {b — 1) -j- (b — 1) = b 2 — 1 < b 2 . 

The following MIX program shows the considerations that are necessary when 
Algorithm M is implemented on a computer. The coding for step M4 would be a 
little simpler if our computer had a “multiply-and-add” instruction, or if it had 
a double-length accumulator for addition. 

Program M. ( Multiplication of nonnegative integers). This program is analogous 
to Program A. rll = i, rI2 = j , rI3 = i + j, CONTENTS (CARRY) = k. 


01 

ENT1 N 

1 

Ml. Initialize. 

02 

J0V 0FL0 

1 

Ensure overflow is off. 

03 

STZ W+M,1 

N 

U)m-\-i 4 0. 

04 

DEC1 1 

N 


05 

J1P *-2 

N 

Repeat for n > i > 0 
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06 ENT2 M 1 j <— m. 

07 1H LDX V,2 M M2. Zero multiplier? 

08 JXZ 8F M If Vj = 0, set Wj <— 0 and go to M6. 

09 ENT1 N M — Z M3. Initialize i. 

10 ENT3 N,2 M — Z i <- n, (i + j) <- (n + j). 

11 ENTX 0 M — Z k <- 0. 

12 2H STX CARRY (M — Z)N M4. Multiply and add. 

13 LDA U,1 ( M — Z)N 

14 MUL V,2 (M — Z)N tAX <- X Vj. 

45 SLC 5 (M — Z)AT Interchange rA «-► rX. 

16 ADD W,3 (M — Z)N Add Wi+j to lower half. 

17 JNOV *+2 (M — Z)N Did overflow occur? 

18 INCX 1 K If so, carry 1 into upper half. 

19 ADD CARRY (M - Z)N Add k to lower half. 

20 JNOV *+2 ( M — Z)N Did overflow occur? 

21 INCX 1 K' If so, carry 1 into upper half. 

22 STA W,3 (M — Z)N w l+J <- t mod b. 

23 DEC1 1 ( M — Z)N M5. Loop on i. 

24 DEC3 1 (M — Z)N Decrease i and (i j) by 1. 

25 J1P 2B (M ~ Z)N Back to M4 if i > 0; rX = [tfb\. 

26 8H STX W,2 M Set Wj <— k. 

27 DEC2 1 M M6. Loop on j. 

28 J2P IB AY Repeat until j = 0. | 

The execution time of Program M depends on the number of places, M, in 
the multiplier v; the number of places, N, in the multiplicand u; the number 
of zeros, Z, in the multiplier; and the number of carries, K and K', that occur 
during the addition to the lower half of the product in the computation of t. If we 
approximate both K and K' by the reasonable (although somewhat pessimistic) 
values J(M — Z)N, we find that the total running time comes to 28 MN -|- 
10M + 4 N + 3 — Z(2SN + 3) cycles. If step M2 were deleted, the running time 
would be 2SMN -\-lM -f - 4 N + 3 cycles, so this step is not advantageous unless 
the density of zero positions within the multiplier is Z/M > 3/(28iV -(- 3). If 
the multiplier is chosen completely at random, this ratio Z/M is expected to be 
only about 1/b, which is extremely small; so step M2 is usually not worthwhile, 
unless b is small. 

Algorithm M is not the fastest way to multiply when m and n are large, 
although it has the advantage of simplicity. Speedier methods are discussed in 
Section 4.3.3; even when m = n = 4, it is possible to multiply numbers in a 
little less time than is required by Algorithm M. 

The final algorithm of concern to us in this section is long division, in which 
we want to divide (n + m)-place integers by n-place integers. Here the ordinary 
pencil-and-paper method involves a certain amount of guesswork and ingenuity 
on the part of the person doing the division; we must either eliminate this guess¬ 
work from the algorithm or develop some theory to explain it more carefully. 

A moment’s reflection about the ordinary process of long division shows that 
the general problem breaks down into simpler steps, each of which is the division 
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_ 9 _ 

H v 2 . . .» B )«o«it*2 • • - u n 

Fig. 6. Wanted: a way to <_ qv _> 

determine q rapidly. ~ r ? 


of an (n -f- l)-place number u by the n-place divisor v, where 0 < u/v < 6 ; the 
remainder r after each step is less than v, so we may use the quantity rb-\~ (next 
place of dividend) as the new u in the succeeding step. For example, if we are 
asked to divide 3142 by 47, we first divide 314 by 47, getting 6 and a remainder 
of 32; then we divide 322 by 47, getting 6 and a remainder of 40; thus we have 
a quotient of 66 and a remainder of 40. It is clear that this same idea works in 
general, and so our search for an appropriate division algorithm reduces to the 
following problem (Fig. 6 ): 

Let u = (uqUi ...u n )b and v = (V 1 V 2 ... v n )b be nonnegative integers in 
radix-b notation, such that u/v < b. Find an algorithm to determine q = 
[u/v\. 


We may observe that the condition u/v < b is equivalent to the condition that 
u/b < V] i.e., [u/b\ < v. This is simply the condition that (upU\ ...u n ^i)b < 
{viV 2 . •. v n )b. Furthermore, if we write r = u~qv, then q is the unique integer 
such that 0 < r < v. 

The most obvious approach to this problem is to make a guess about q , 
based on the most significant digits of u and v. It isn’t obvious that such a 
method will be reliable enough, but it is worth investigating; let us therefore set 


Q 


min ^ 


upb + Uj 
Vi 




This formula says q is obtained by dividing the two leading digits of u by the 
leading digit of v; and if the result is b or more we can replace it by (6 — 1 ). 

It is a remarkable fact, which we will now investigate, that this value q 
is always a very good approximation to the desired answer q, so long as Vi is 
reasonably large. In order to analyze how close q comes to q, we will first prove 
that q is never too small. 


Theorem A. In the notation above, q > q. 

Proof. Since q < b— 1 , the theorem is certainly true if q = b — 1 . Otherwise 
we have q = [(u 0 6 + ui)/vi\, hence qv\ > u 0 b + Ui — v\ + 1. It follows that 

u — qv < u — qvib n ~ x < uob n -j-(- u n 

~ (u 0 b n + ^ 16^— 1 - vxb 71 - 1 + 6 n - J ) 

= u 2 b n - 2 -\ -< Vib 71 - 1 < v. 

Since u — qv < v, we must have q > q. | 
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We will now prove that q cannot be much larger than q in practical situa¬ 
tions. Assume that q > q - f- 3. We have 


. . u Q b + ui u 0 b n ~\~u 1 b n 1 
q < - = -:-:- < 


u 


v^ 


Vib n ~ 


Vib n ~ 


< 


u 


V 


b n ~l' 


(The case v = b n ~ 1 is impossible, for if v = (100... 0)& then q = q.) Further¬ 
more, the relation q > ( u/v ) — 1 implies that 


3 < Q < 


u 


v — b n 


— i 


u uf b n 1 1 

-h 1- — ~( -:-r I 4~ 1- 

v v\v — 6 n—1 


Therefore 


u 

v 


> 2 


v — b n ~ l 
b n ~ 1 


> 2(Ui — 1). 


Finally, since 5 — 4>g — 3>q = [u/uj > 2(tq — 1), we have tq < [b/2J. 
This proves the result we seek: 


Theorem B. If v\ > \b/2\, then q — 2 < q < q< I 

The most important part of this theorem is that the conclusion is independ¬ 
ent of b; no matter how large b is, the trial quotient q will never be more than 
2 in error. 

The condition that iq > \b/2\ is very much like a normalization require¬ 
ment; in fact, it is exactly the condition of floating-binary normalization in a 
binary computer. One simple way to ensure that V\ is sufficiently large is to 
multiply both u and v by \b/(v i -j- 1)J; this does not change the value of u/v, 
nor does it increase the number of places in v, and exercise 23 proves that it will 
always make the new value of tq large enough. (Note: Another way to normalize 
the divisor is discussed in exercise 28.) 

Now that we have armed ourselves with all of these facts, we are in a position 
to write the desired long-division algorithm. This algorithm uses a slightly 
improved choice of q in step D3, which guarantees that q = q or q — 1; in 
fact, the improved choice of q made here is almost always accurate. 


Algorithm D ( Division of nonnegative integers). Given nonnegative integers 
u — (U 1 U 2 ■ . .u m ± n )b and v — (tqv 2 ... v n ) b , where tq ^ 0 and n > 1, we 
form the radix-5 quotient \u/v J = (q$qi ... q m )b and the remainder it mod t/ — 
(r\T 2 • •. r n )b‘ (This notation is slightly different from that used in the above 
proofs. When n = 1, the simpler algorithm of exercise 16 should be used.) 


Dl. [Normalize.] Set d «- [b/(vi 4- 1)J. Then set . .. U m +n)b equal to 

(U 1 U 2 • • • u Tn q- n )b times d, and set (tqv 2 ... u n ) b equal to (tqv 2 • • - v n ) b times d. 
(Note the introduction of a new digit position uq at the left of up, if d = 1, 
all we need to do in this step is to set uq «— 0. On a binary computer it 
may be preferable to choose d, to be a power of 2 instead of using the value 
suggested here; any value of d that results in tq > [b/2\ will suffice. See 


also exercise 37.) 
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Fig. 7. Long division. 


D2. [Initialize j.] Set j +— 0. (The loop on j, steps D2 through D7, will be 
essentially a division of (ujUj+i ... Uj+ n ) b by (v\V 2 ■.. v n )b to get a single 
quotient digit qj ; cf. Fig. 6.) 

D3. [Calculate q.] If u 3 = V\, set q <— 6—1; otherwise set q <— [(Ujb-\-Uj-\-i)/vi\. 
Now test if v 2 q > (ujb + %+i — qvi)b %+ 2 ; if so, decrease q by 1 and 
repeat this test. (The latter test determines at high speed most of the cases 
in which the trial value q is one too large, and it eliminates all cases where 
q is two too large; see exercises 19, 20, 21.) 

D4. [Multiply and subtract.] Replace {ujU j+1 ... u 3+n ) b by (UjU 3+1 ... u j+n ) b 
minus q times (viv 2 ... v n ) b . This step (analogous to steps M3, M4, and M5 
of Algorithm M) consists of a simple multiplication by a one-place number, 
combined with a subtraction. The digits (u 3 , u 3 + 1 ,..., Uj+ n ) should be kept 
positive; if the result of this step is actually negative, (itjU 3 +i... u 3 + n )b 
should be left as the true value plus 6 71- * -1 , i.e., as the 6’s complement of the 
true value, and a “borrow” to the left should be remembered. 

D5. [Test remainder.] Set qj «— q. If the result of step D4 was negative, go to 
step D6; otherwise go on to step D7. 

D6. [Add back.] (The probability that this step is necessary is very small, on the 
order of only 2/6, as shown in exercise 21; test data that activates this step 
should therefore be specifically contrived when debugging.) Decrease qj by 
1, and add ( 0viv 2 ... v n ) b to (ujUj+iUj +2 • • • iij+ n )&- (A carry will occur to 
the left of Uj, and it should be ignored since it cancels with the “borrow” 
that occurred in D4.) 

D7. [Loop on j.\ Increase j by one. Now if j < m, go back to D3. 

D8. [Unnormalize.] Now (qoqi ■ ■. q m )b is the desired quotient, and the desired 
remainder may be obtained by dividing (u m + 1 ... u m + n )b by d. | 

The representation of Algorithm D as a MIX program has several points of 

interest: 
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Program D (Division of nonnegative integers). The conventions of this program 
are analogous to Program A; rll _ i, rI2 = j — m, rI3 = i + j- 


001 

Dl 

JOV 

0FL0 

1 

039 

D2 

ENN2 

M 

1 

040 


STZ 

V 

1 

041 

D3 

LDA 

U+M,2(1: 

: 5) M + l 

042 


LDX 

U+M+1,2 

M + 1 

043 


DIV 

V+l 

M + l 

044 


JOV 

IF 

M + l 

045 


STA 

QHAT 

M + l 

046 


STX 

RHAT 

M+l 

047 


JMP 

2F 

M + l 

048 

1H 

LDX 

WM1 


049 


LDA 

U+M+1,2 


050 


JMP 

4F 


051 

3H 

LDX 

QHAT 

E 

052 


DECX 

1 

E 

053 


LDA 

RHAT 

E 

054 

4H 

STX 

QHAT 

E 

055 


ADD 

V+l 

E 

056 


JOV 

D4 

E 

057 


STA 

RHAT 

E 

058 


LDA 

QHAT 

E 

059 

2H 

MUL 

V+2 

M + S + l 

060 


CMPA 

RHAT 

M + E+ 1 

061 


JL 

D4 

M + £ + 1 

062 


JG 

3B 

E 

063 


CMPX 

U+M+2,2 


064 


JG 

3B 


065 

D4 

ENTX 

1 

M + l 

066 


ENT1 

N 

M + l 

067 


ENT3 

M+N, 2 

M + l 

068 

2H 

STX 

CARRY 

(M + 1){N + 1) 

069 


LDAN 

V,1 

(M+l)(iV + l) 

070 


MUL 

QHAT 

(M + l)(iV + 1) 

071 


SLC 

5 

(M + 1)(N + 1) 

072 


ADD 

CARRY 

(M + 1){N + 1) 

073 


JNOV 

*+2 

(M + 1){N + 1) 

074 


DECX 

1 

K 

075 


ADD 

U, 3 

(M + 1){N + 1) 

076 


ADD 

WM1 

(M + 1)(N + 1) 

077 


JNOV 

*+2 

(M + 1)(N + 1) 

078 


INCX 

1 

K' 

079 


STA 

U, 3 

(M + 1)(N + 1) 

080 


DEC1 

1 

(M + 1)(N + 1) 

081 


DEC3 

1 

(M + 1)(AT + 1) 

082 


JINN 

2B 

(M + 1)(N + 1) 


Dl. Normalize. 

(See exercise 25) 

D2. Initialize i. 

Set Vo <— 0, for convenience in D4. 
D3. Calculate q. 
rAX x— Ujb -j- Uj+\. 
rA <— [rAX jv\\. 

Jump if quotient = b. 
q <— rA. 

r <— Ujb + %-+-i — qvi 

= (Ujb + Uj-\ i) mod Vi . 
rX<-b—1. 

rA 4 — Uj+ 1 . (Here Uj = V\.) 


Decrease q by one. 

Adjust f accordingly: 
q <- rX. 
rA «— rA + v \. 

(If r will be > b, V 2 q will be < rb.) 
r <— rA. 


Test if V 2 q < fb -(- Uj+ 2 - 


If not, q is too large. 

D4. Multiply and subtract. 
i <— n. 

(i + j) (n + i). 

(Here 1 - b < rX < +1.) 

rAX 4 - qvi. 

Interchange rA «-► rX. 

Add the contribution from the 
digit to the right, plus 1 . 

If sum is < — b, carry —1. 

Add u l j rJ . 

Add b — 1 to force -f* sign. 

If no overflow, carry —1. 

rX = carry -j- 1. 

u t -\-j +— rA (may be minus zero). 

Repeat for n > i > 0. 
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083 

D5 

LDA 

QHAT 

M 

4- 

l 

084 


STA 

Q+M.2 

M 

4- 

l 

085 


JXP 

D7 

M 

+ 

l 

086 

D6 

DECA 

1 




087 


STA 

Q+M, 2 




088 


ENT1 

N 




089 


ENT3 

M+N, 2 




090 

1H 

ENTA 

0 




091 

2H 

ADD 

U, 3 




092 


ADD 

V,1 




093 


STA 

U, 3 




094 


DEC1 

1 




095 


DEC 3 

1 




096 


JNOV 

IB 




097 


ENTA 

1 




098 


JINN 

2B 




099 

D7 

INC2 

1 

M 

4- 

l 

100 


J2NP 

D3 

M 

+ 

1 

101 

D8 

* . . 






D5. Test remainder. 

Set qj <— q. 

(Here rX = 0 or 1, since i>o = 0.) 
D6. Add back. 

Set qj <— q — 1 . 
i <— n. 

+ 3) «- (n + 3')- 
(This is essentially Program A.) 


D7. Loop on i. 

Repeat for 0 < j < m. 
(See exercise 26) | 


Note how easily the rather complex-appearing calculations and decisions of 
step D3 can be handled inside the machine. Note also that the program for 
step D4 is analogous to Program M, except that the ideas of Program S have 
also been incorporated. 

The running time for Program D can be estimated by considering the quan¬ 
tities M, N, E, K, and K' shown in the program. (These quantities ignore 
several situations that occur only with very low probability; for example, we 
may assume that lines 048-050, 063-064, and step D6 are never executed.) Here 
M + 1 is the number of words in the quotient; N is the number of words in 
the divisor; E is the number of times q is adjusted downwards in step D3; K 
and K' are the number of times certain “carry” adjustments are made during 
the multiply-subtract loop. If we assume that K -\- K' is approximately (N -f- 
1 )(M + 1)> an d that E is approximately JM, we get a total running time of 
approximately 

30 MN + 30AT + 89 M + 111 

cycles, plus 67A r -|-235M + 4 more if d > 1. (The program segments of exercises 
25 and 26 are included in these totals.) When M and N are large, this is only 
about seven percent longer than the time Program M takes to multiply the 
quotient by the divisor. 

Further commentary on Algorithm D appears in the exercises at the close 
of this section. 

It is possible to debug programs for multiple-precision arithmetic by using 
the multiplication and addition routines to check the result of the division 
routine, etc. The following type of test data is occasionally useful: 


(t m - \){t n - 1) = i m + n — t n - t m + 1. 
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If m < n, this number has the radix-i expansion 

(t — 1) ... (t~ 1) (t —2) (t — 1) ... (t — 1) 0 ... 0 1; 


m — 1 places n — m places m — 1 places 

for example, (10 3 — 1)(10 5 — 1) = 99899001. In the case of Program D, it is 
also necessary to find some test cases that cause the rarely executed parts of the 
program to be exercised; some portions of that program would probably never 
get tested even if a million random test cases were tried. 

Now that we have seen how to operate with signed-magnitude numbers, 
let us consider what approach should be taken to the same problems when a 
computer with complement notation is being used. For two’s complement and 
ones’ complement notations, it is usually best to let the radix b be one half 
of the word size; thus for a 32-bit computer word we would use b = 2 31 in 
the above algorithms. The sign bit of all but the most significant word of a 
multiple-precision number will be zero, so that no anomalous sign correction 
takes place during the computer’s multiplication and division operations. In 
fact, the basic meaning of complement notation requires that we consider all but 
the most significant word to be nonnegative. For example, assuming an 8-bit 
word, the two’s complement number 

11011111 1111110 1101011 

(where the sign bit is shown only in the most significant word) is properly thought 
of as 

— 2 21 + ( 1011111)2 • 2 14 - f - ( 1111110)2 ■ 2 7 + ( 1101011 ) 2 . 

Addition of signed numbers is slightly easier when complement notations 
are being used, since the routine for adding n-place nonnegative integers can be 
used for arbitrary n-place integers; the sign appears only in the first word, so 
the less significant words may be added together irrespective of the actual sign. 
(Special attention must be given to the leftmost carry when ones’ complement 
notation is being used, however; it must be added into the least significant 
word, and possibly propagated further to the left.) Similarly, we find that 
subtraction of signed numbers is slightly simpler with complement notation. 
On the other hand, multiplication and division seem to be done most easily 
by working with nonnegative quantities and doing suitable complementation 
operations beforehand to make sure that both operands are nonnegative; it may 
be possible to avoid this complementation by devising some tricks for working 
directly with negative numbers in a complement notation, and it is not hard to 
see how this could be done in double-precision multiplication, but care should be 
taken not to slow down the inner loops of the subroutines when high precision 
is required. Note that the product of two m-place numbers in two’s complement 
notation may require 2 m -f-1 places: the square of (— b m ) is b 2m . 
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Let us now turn to an analysis of the quantity K that arises in Program A, 
i.e., the number of carries that occur when two n-place numbers are being added 
together. Although K has no effect on the total running time of Program A, 
it does affect the running time of the Program A’s counterparts that deal with 
complement notations, and its analysis is interesting in itself as a significant 
application of generating functions. 

Suppose that u and v are independent random n-place integers, uniformly 
distributed in the range 0 < u, v < b n . Let p nk be the probability that exactly 
k carries occur in the addition of u to v, and that one of these carries occurs 
in the most significant position (so that u-\-v > b n ). Similarly, let q nk be the 
probability that exactly k carries occur, but that there is no carry in the most 
significant position. Then it is not hard to see that 


P0k = o, 

QOk = $0k, 

for all k; 


6+1 

b— 1 

P(n+l)(fc+l) 

- 2f> Pnk + 

26 Qnk ’ 


6-1 

6 + 1 

<?(n+l)fc 

- ^ Pnk + 

26 qnk ’ 


this happens because (b—l)/2b is the probability that u\ -\-Vi > b and (6+l)/26 
is the probability that U\ -f- v\ + 1 > b, when U\ and v\ are independently and 
uniformly distributed integers in the range 0 < u± , Vi < b. 

To obtain further information about these quantities p nk and q nk , we may 
set up the generating functions 

P(z,t) = ^Pnk^t 71 , Q(z,t) = J2q nk z k t n . (4) 

k,n k,n 

From (3) we have the basic relations 

P(z, t) = Zt(^-P(z, t) + ^Q{Z, t)), 

Q(z, t) = 1 +1 t) + b -~Q(z, t)) • 

These two equations are readily solved for P(z, t) and Q(z, t); and if we let 

G(z, t) = P(z, t ) + Q(z, t) = Gn(z)t n , 

n 

where G n (z) is the generating function for the total number of carries when 
n-place numbers are added, we find that 

G(z, t) = (6 — zt)/p(z, t), where p(z, t) — b — J(1 + b)( 1 + z)t + zt 2 . (5) 

Note that G(l,t) = 1/(1 — t), and this checks with the fact that G n ( 1) must 
equal 1 (it is the sum of all the possible probabilities). Taking partial derivatives 
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of (5) with respect to z, we find that 


dG 

dz 

d 2 G 
dz 2 


EW" = 

n 


—t t(b — £$)(& -|- 1 — 2i) 

p(z, t) + 2p(z, t ) 2 

—1 2 (6 -f- 1 — 2t) £ 2 (fr — zt)(6 + 1 — 2 1) 

p{z,t) 2 p{z,t) 3 


Now let us put z = 1 and expand in partial fractions: 


n v v 

\ ^ G n (\\t n = —^_ 

~ H(l-i) 3 (6 —1)2(1 —' (6— 1)2(6 —0 


i) 2 (6-l)(l-t) (b-l)(b-t)J’ 

1 1 


+ 


1 


(b-l)(b-t) 2 

It follows that the average number of carries, i.e., the mean value of K, is 



the variance is 
^(1) + G'(l)-Gi(l) 2 

_lf . 2n 2b+l 2b + 2fl\ n 1 /n 2n ^ 

4^" + 6-l (ft-1) 2 + (b— l) 2 \bj (b-l) 2 [bj y 



So the number of carries is just slightly less than \n under these assumptions. 

History and bibliography. The early history of the classical algorithms described 
in this section is left as an interesting project for the reader, and only the history 
of their implementation on computers will be traced here. 

The use of 10 n as an assumed radix when multiplying large numbers on a 
desk calculator was discussed by D. N. Lehmer and J. P. Ballantine, AMM 30 
(1923), 67-69. 

Double-precision arithmetic on digital computers was first treated by J. von 
Neumann and H. H. Goldstine in their introductory notes on programming, 
originally published in 1947 [J. von Neumann, Collected Works 5, 142-151]. 
Theorems A and B above are due to D. A. Pope and M. L. Stein [CACM 3 
(1960), 652-654], whose paper also contains a bibliography of earlier work on 
double-precision routines. Other ways of choosing the trial quotient q have been 
discussed by A. G. Cox and H. A. Luther, CACM 4 (1961), 353 [divide by v\ +1 
instead of Vi\, and by M. L. Stein, CACM 7 (1964), 472-474 [divide by v\ or 
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v\ + 1 according to the magnitude of v 2 \] E. V. Krishnamurthy [ CACM 8 (1965), 
179-181] showed that examination of the single-precision remainder in the latter 
method leads to an improvement over Theorem B. Krishnamurthy and Nandi 
[CACM 10 (1967), 809-813] suggested a way to replace the normalization and 
unnormalization operations of Algorithm D by a calculation of q based on several 
leading digits of the operands. G. E. Collins and D. R. Musser have carried out 
an interesting analysis of the original Pope and Stein algorithm [Inf. Proc. Letters 
6 (1977), 151-155]. 

Several alternative methods for division have also been suggested: 

1) “Fourier division” [J. Fourier, Analyse des equations determinees (Paris, 
1831), §2.21]. This method, which was often used on desk calculators, essentially 
obtains each new quotient digit by increasing the precision of the divisor and the 
dividend at each step. Some rather extensive tests by the author have indicated 
that such a method is inferior to the divide-and-correct technique above, but 
there may be some applications in which Fourier division is practical. See D. H. 
Lehmer, AMM 33 (1926), 198-206; J. V. Uspensky, Theory of Equations (New 
York: McGraw-Hill, 1948), 159-164. 

2) “Newton’s method” for evaluating the reciprocal of a number was extensively 
used in early computers when there was no single-precision division instruction. 
The idea is to find some initial approximation z 0 to the number 1 /v, then to let 
x n y i = 2x n —vx 2 . This method converges rapidly to 1/v, since x n == (l — t)/v 
implies that x n+1 — (l — e 2 )/v. Convergence to third order, i.e., with e replaced 
by 0(e 3 ) at each step, can be obtained using the formula 


£ n + l = X n + X n (l — VX n ) + X n (l — VX n ) 2 
= X n (l + (1 — VX n )(l + (1 — VX n ))), 

and similar formulas hold for fourth-order convergence, etc.; see P. Rabinowitz, 
CACM 4 (1961), 98. For calculations on extremely large numbers, Newton’s 
second-order method and subsequent multiplication by u can actually be con¬ 
siderably faster than Algorithm D, if we increase the precision of x n at each step 
and if we also use the fast multiplication routines of Section 4.3.3. (See Algorithm 
4.3.3D for details.) Some related iterative schemes have been discussed by E. Y. 
Krishnamurthy, IEEE Trans. C-19 (1970), 227-231. 

3) Division methods have also been based on the evaluation of 

See H. H. Laughlin, AMM 37 (1930), 287-293. We have used this idea in the 
double-precision case (Eq. 4.2.3-3). 

Besides the references just cited, the following early articles concerning 
multiple-precision arithmetic are of interest: High-precision routines for floating 
point calculations using ones’ complement arithmetic are described by A. H. 
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Stroud and D. Secrest, Comp. J. 6 (1963), 62 66. Extended-precision subroutines 
for use in FORTRAN programs are described by B. I. Blum, CACM 8 (1965), 
318-320; and for use in ALGOL by M. Tienari and V. Suokonautio, BIT 6 (1966), 
332-338. Arithmetic on integers with unlimited precision, making use of linked 
memory allocation techniques, has been elegantly described by G. E. Collins, 
CACM 9 (1966), 578-589. For a much larger repertoire of operations, including 
logarithms and trigonometric functions, see R. P. Brent, ACM Trans. Math. 
Software 4 (1978), 57-81. 

We have restricted our discussion in this section to arithmetic techniques 
for use in computer programming. There are many algorithms for hardware 
implementation of arithmetic operations that are very interesting, but they 
appear to be inapplicable to computer programs for high-precision numbers; 
for example, see G. W. Reitwiesner, “Binary Arithmetic,” Advances in Com¬ 
puters 1 (New York: Academic Press, 1960), 231-308; 0. L. MacSorley, Proc. 
IRE 49 (1961), 67-91; G. Metze, IRE Trans. EC-11 (1962), 761-764; H. L. 
Garner, “Number Systems and Arithmetic,” Advances in Computers 6 (New 
York: Academic Press, 1965), 131-194. The minimum achievable execution time 
for hardware addition and multiplication operations has been investigated by S. 
Winograd, JACM 12 (1965), 277-285, 14 (1967), 793-802; by R. P. Brent, IEEE 
Trans. C-19 (1970), 758-759; and by R. W. Floyd, IEEE Symp. Found. Comp. 
Sci. 16 (1975), 3-5. 


EXERCISES 

1. [42] Study the early history of the classical algorithms for arithmetic by looking 
up the writings of, say, Sun Tsu, al-Khwarizmi, al-Uqlidisi, Fibonacci, and Robert 
Recorde, and by translating their methods as faithfully as possible into precise algo¬ 
rithmic notation. 

2 . [15] Generalize Algorithm A so that it does “column addition,” i.e., obtains the 
sum of m nonnegative n -place integers. (Assume that m < b.) 

3. [21] Write a MIX program for the algorithm of exercise 2, and estimate its running 
time as a function of m and n. 

4. [ M21 ] Give a formal proof of the validity of Algorithm A, using the method of 
“inductive assertions” explained in Section 1.2.1. 

5. [21] Algorithm A adds the two inputs by going from right to left, but sometimes 
the data is more readily accessible from left to right. Design an algorithm that produces 
the same answer as Algorithm A, but that generates the digits of the answer from left 
to right, going back to change previous values if a carry occurs to make a previous 
value incorrect. [Note: Early Hindu and Arabic manuscripts dealt with addition from 
left to right in this way, probably because it was customary to work from left to right 
on an abacus; the right-to-left addition algorithm was a refinement due to al-Uqlidisi, 
perhaps because Arabic is written from right to left.] 
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► 6. [22] Design an algorithm that adds from left to right (as in exercise 5), but your 
algorithm should not store a digit of the answer until this digit cannot possibly be 
affected by future carries; there is to be no changing of any answer digit once it has 
been stored. [Hint: Keep track of the number of consecutive (b — l)’s that have not yet 
been stored in the answer.] This sort of algorithm would be appropriate, for example, 
in a situation where the input and output numbers are to be read and written from 
left to right on magnetic tapes, or if they appear in straight linear lists. 

7. [ M26] Determine the average number of times the algorithm of exercise 5 will find 
that a carry makes it necessary to go back and change k digits of the partial answer, 
for A: = 1, 2, ..., n. (Assume that both inputs are independently and uniformly 
distributed between 0 and b n — 1.) 

8. [M26] Write a MIX program for the algorithm of exercise 5, and determine its 
average running time based on the expected number of carries as computed in the text. 

► 9. [21] Generalize Algorithm A to obtain an algorithm that adds two n-place num¬ 
bers in a mixed-radix number system, with bases bo, b i, ... (from right to left). Thus 
the least significant digits lie between 0 and t>o — 1, the next digits lie between 0 and 
&i — 1, etc.; cf. Eq. 4.1-9. 

10. [18] Would Program S work properly if the instructions on lines 06 and 07 were 
interchanged? If the instructions on lines 05 and 06 were interchanged? 

11. [10] Design an algorithm that compares two nonnegative n-place integers u = 
(u\U 2 ... u n )b and v = (V 1 V 2 ... u n )b, to determine whether u < v, u = v, or u > v. 

12. [16] Algorithm S assumes that we know which of the two input operands is the 
larger; if this information is not known, we could go ahead and perform the subtraction 
anyway, and we would find that an extra “borrow” is still present at the end of the 
algorithm. Design another algorithm that could be used (if there is a “borrow” present 
at the end of Algorithm S) to complement {w\W 2 . •. w n )b and therefore to obtain the 
absolute value of the difference of u and v. 

13. [21] Write a MIX program that multiplies ( 1 L 1 U 2 ... u n )b by v, where v is a single¬ 
precision number (i.e., 0 < v < b), producing the answer (wqWi ...w n )b. How much 
running time is required? 

► 14. [M24] Give a formal proof of the validity of Algorithm M, using the method of 
“inductive assertions” explained in Section 1.2.1. 

15. [M20] If we wish to form the product of two n-place fractions, (.U 1 U 2 ■. .u n )b X 
(.t>it >2 ... v n )b, and to obtain only an n-place approximation (.W 1 W 2 ... w n )b to the 
result, Algorithm M could be used to obtain a 2n-place answer that is subsequently 
rounded to the desired approximation. But this involves about twice as much work 
as is necessary for reasonable accuracy, since the products u%Vj for i -)- j > n -}- 2 
contribute very little to the answer. 

Give an estimate of the maximum error that can occur, if these products u^j for 
i + j > n + 2 are not computed during the multiplication, but are assumed to be zero. 

► 16. [20] Design an algorithm that divides a nonnegative n-place integer (U 1 U 2 ... u n )b 
by v, where v is a single-precision number (i.e., 0 < v < b), producing the quotient 
(W 1 W 2 • • ■ w n )b and remainder r. 

17. [M20] In the notation of Fig. 6, assume that v\ > [6/2J; show that if uo = v\, 
we must have q — b — lorb — 2. 
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18. [M20] In the notation of Fig. 6 , show that if q' = [(uob ~h ui)/(vi + 1)J, then 

q' < q- 

► 19. [M21] In the notation of Fig. 6 , let q be an approximation to q, and let r — 
Uob-\-ui — qv i. Assume that v\ > 0. Show that if v 2 q > br + u 2 , then q < q. [Hint: 
Strengthen the proof of Theorem A by examining the influence of v 2 .\ 

20. [M22] Using the notation and assumptions of exercise 19, show that if v 2 q < 
br u 2 , then q = q or q = q — 1 . 

► 21 . [M23] Show that if V\ > L6/2J, and if v 2 q < br u 2 but q ^ q in the notation 
of exercises 19 and 20, then wmodv > (1 — 2 fb)v. (The latter event occurs with 
approximate probability 2 / 6 , so that when b is the word size of a computer we must 
have qj = q in Algorithm D except in very rare circumstances.) 

► 22. [24] Find an example of a four-digit number divided by a three-digit number for 
which step D 6 is necessary in Algorithm D, when the radix b is 10. 

23. [ M23 ] Given that v and b are integers, and that 1 < v < b, prove that we always 
have |_ 6 / 2 J < v[b/(v -(- 1 )J < [v -f- l)[b/(v -(- 1 )J < b. 

24. [ M20 ] Using the law of the distribution of leading digits explained in Section 
4.2.4, give an approximate formula for the probability that d = 1 in Algorithm D. 
(When d = 1, it is, of course, possible to omit most of the calculation in steps D1 
and D 8 .) 

25. [26] Write a MIX routine for step Dl, which is needed to complete Program D. 

26. [21] Write a MIX routine for step D 8 , which is needed to complete Program D. 

27. [M20] Prove that at the beginning of step D 8 in Algorithm D, the unnormalized 
remainder (u m -f iu m -|_2 • ■ ■ u m -\-n)b is always an exact multiple of d. 

28. [ M30 ] (A. Svoboda, Stroje na Zpracovani Informaci 9 (1963), 25-32.) Let v = 
(viv 2 ... v n )b be any radix b integer, where V\ yZ 0. Perform the following operations: 

Nl. If Vi < 6 / 2 , multiply v by [(6 + l)/(hi + l)J- Let the result of this step be 

{v Q ViV 2 ...Vn)b- 

N2. If Vo = 0, set v <— v —j- (l/ 6 )[ 6(6 — V\)/(v\ -j- l)Ju; let the result of this step be 
(voViv 2 ... Vn-Vn+i ... )&■ Repeat step N2 until Vo 7 ^ 0. 

Prove that step N2 will be performed at most three times, and that we must always 
have vo = 1 , V\ — 0 at the end of the calculations. 

[Note: If u and v are both multiplied by the above constants, we do not change 
the value of the quotient u/v, and the divisor has been converted into the form 
(IOV 2 .. .v n .v n +iVn-\- 2 V n -\- 3 )b- This form of the divisor is very convenient because, in 
the notation of Algorithm D, we may simply take q — Uj as a trial divisor at the 
beginning of step D3, or q — b — 1 when (uj—i, Uj ) — (1,0).] 

29. [15] Prove or disprove: At the beginning of step D7 of Algorithm D, we always 
have Uj — 0 . 

► 30. [22] If memory space is limited, it may be desirable to use the same storage 
locations for both input and output during the performance of some of the algorithms 
in this section. Is it possible to have Wi, ..., u> n stored in the same respective locations 
as Ui, ..., u n or t>i, ..., v n during Algorithm A or S? Is it possible to have the quotient 
qo, ..., q m occupy the same locations as Uq, ..., u m in Algorithm D? Is there any 
permissible overlap of memory locations between input and output in Algorithm M? 
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31 . [28] Assume that 6 = 3 and that u = {u\... u m + n ) 3 , v = {v\... v n )3 are integers 
in balanced ternary notation (cf. Section 4.1), vi 7 ^ 0. Design a long-division algorithm 
that divides u by v, obtaining a remainder whose absolute value does not exceed 

Try to find an algorithm that would be efficient if incorporated into the arithmetic 
circuitry of a balanced ternary computer. 

32 . [M 40 ] Assume that 6 = 2 i and that u and v are complex numbers expressed in 
the quater-imaginary number system. Design algorithms that divide u by v, perhaps 
obtaining a suitable remainder of some sort, and compare their efficiency. [ References: 
M. Nadler, CACM 4 (1961), 192-193; Z. Pawlak and A. Wakulicz, Bull, de l’Acad. 
Polonaise des Sciences, Classe HI, 5 (1957), 233-236 (see also pp. 803-804); and exercise 
4.1-15.] 

33 . [M 4 O] Design an algorithm for taking square roots, analogous to Algorithm D and 
to the traditional pencil-and-paper method for extracting square roots. 

34 . [40] Develop a set of computer subroutines for doing the four arithmetic opera¬ 
tions on arbitrary integers, putting no constraint on the size of the integers except for 
the implicit assumption that the total memory capacity of the computer should not be 
exceeded. (Use linked memory allocation, so that no time is wasted in finding room 
to put the results.) 

35 . [40] Develop a set of computer subroutines for “decuple-precision floating point” 
arithmetic, using excess 0 , base 6 , nine-place floating point number representation, 
where 6 is the computer word size, and allowing a full word for the exponent. (Thus 
each floating point number is represented in 10 words of memory, and all scaling is 
done by moving full words instead of by shifting within the words.) 

36 . [ M42 ] Compute the values of the fundamental constants listed in Appendix B to 
much higher precision than the 40-place values listed there. [Note: The first 100,000 
digits of the decimal expansion of 7 r were published by D. Shanks and J. W. Wrench, 
Jr., in Math. Comp. 16 (1962), 76-99. One million digits of 7 r were computed by Jean 
Guilloud and Martine Bouyer of the French Atomic Energy Commission in 1974.] 

► 37 . [20] (E. Salamin.) Explain how to avoid the normalization and unnormalization 
steps of Algorithm D, when d is a power of 2 on a binary computer, without changing 
the sequence of trial quotient digits computed by that algorithm. (How can q be 
computed in step D3 if the normalization of step Dl hasn’t been done?) 


*4.3.2. Modular Arithmetic 

Another interesting alternative is available for doing arithmetic on large integer 
numbers, based on some simple principles of number theory. The idea is to have 
several “moduli” m\, m 2 , ... ,m r that contain no common factors, and to work 
indirectly with “residues” u mod mi, it mod m 2 , ..., wmodm r instead of directly 
with the number u. 

For convenience in notation throughout this section, let 

m = u mod mi, u 2 — u mod m 2 , ..., u r = umodm r . ( 1 ) 

It is easy to compute (ui, U 2 , ..., u r ) from an integer number u by means of 
division; and it is important to note that no information is lost in this process, 
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since we can recompute u from (ui, 112 ,..., u r ) provided that u is not too large. 
For example, if 0 < u < v < 1000, it is impossible to have (u mod 7, u mod 
11 , u mod 13) equal to (vmodT, ?;modll, v mod 13). This is a consequence of the 
“Chinese remainder theorem” stated below. 

We may therefore regard ( 14 , u 2 >..., u r ) as a new type of internal computer 
representation, a “modular representation,” of the integer u. 

The advantages of a modular representation are that addition, subtraction, 
and multiplication are very simple: 

(u \,..., u T ) + (vi,... ,v r ) = ((wi Hb tqjmodmi,..., (u r -f v r )niodm r ), ( 2 ) 

(til,... , u r ) — (fi,... ,v r ) = ((ui — Ui)modmi,..., (u r — v r )modm r ), (3) 

(tti,. • • ,u r ) X (fi, ... ,v r ) = ((tii X tq)modrai ,... ,(u r X tv)modm r ). (4) 

To derive (4), for example, we need to show that 

uv mod m 3 — (u mod rrij)(v mod m 3 ) mod m 3 

for each modulus m 3 . But this is a basic fact of elementary number theory: 
2 ;mod m 3 — y mod m 3 if and only if x = y (modulo m 3 ); furthermore if x = x' 
and y = y ', then xy = x'y' (modulo m 3 )\ hence (u mod rrij)(v mod rrij) = uv 
(modulo rrij). 

The disadvantages of a modular representation are that it is comparatively 
difficult to test whether a number is positive or negative or to test whether or not 
(iii,..., u r ) is greater than (v\,..., tv). It is also difficult to test whether or not 
overflow has occurred as the result of an addition, subtraction, or multiplication, 
and it is even more difficult to perform division. When these operations are 
required frequently in conjunction with addition, subtraction, and multiplication, 
the use of modular arithmetic can be justified only if fast means of conversion 
into and out of the modular representation are available. Therefore conversion 
between modular and positional notation is one of the principal topics of interest 
to us in this section. 

The processes of addition, subtraction, and multiplication using ( 2 ), (3), and 
(4) are called residue arithmetic or modular arithmetic. The range of numbers 
that can be handled by modular arithmetic is equal to m — mi m 2 .. .m r , the 
product of the moduli. Therefore we see that the amount of time required to add, 
subtract, or multiply n-digit numbers using modular arithmetic is essentially 
proportional to n (not counting the time to convert in and out of modular 
representation). This is no advantage at all when addition and subtraction are 
considered, but it can be a considerable advantage with respect to multiplication 
since the conventional method of the preceding section requires an execution 
time proportional to n 2 . 

Moreover, on a computer that allows many operations to take place simul¬ 
taneously, modular arithmetic can be a significant advantage even for addition 
and subtraction; the operations with respect to different moduli can all be done 
at the same time, so we obtain a substantial increase in speed. The same kind of 
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decrease in execution time could not be achieved by the conventional techniques 
discussed in the previous section, since carry propagation must be considered. 
Perhaps some day highly parallel computers will make simultaneous operations 
commonplace, so that modular arithmetic will be of significant importance in 
“real-time” calculations when a quick answer to a single problem requiring high 
precision is needed. (With highly parallel computers, it is often preferable to 
run k separate programs simultaneously, instead of running a single program k 
times as fast, since the latter alternative is more complicated but does not utilize 
the machine any more efficiently. “Real-time” calculations are exceptions that 
make the inherent parallelism of modular arithmetic more significant.) 

Now let us examine the basic fact that underlies the modular representation 
of numbers: 

Theorem C (Chinese Remainder Theorem ). Let m i, m 2 , ..., m T be positive 
integers that are relatively prime in pairs, i.e., 

gcd (mj, nth) — 1 when j 7 ^ k. (5) 

Let m — mim 2 ...m r , and let a , u\, U 2 , •••, u r be integers. Then there is 
exactly one integer u that satisfies the conditions 

a < u < a -f- m, and u = Uj (modulo mj) for 1 < j < r. ( 6 ) 

Proof. If u = v (modulo mj) for l < j < r, then u — v is a multiple of mj for 
all j, so (5) implies that u — v is a multiple of m = mi m 2 ... m r . This argument 
shows that there is at most one solution of ( 6 ). To complete the proof we must 
now show the existence of at least one solution, and this can be done in two 
simple ways: 

METHOD 1 (“Nonconstructive” proof). As u runs through the m distinct values 
a < u < a + m, the r-tuples (u mod mi,.. .,umodm r ) must also run through 
m distinct values, since ( 6 ) has at most one solution. But there are exactly 
mim 2 ...m r possible r-tuples (v\,...,v r ) such that 0 < Vj < mj. Therefore 
each r-tuple must occur exactly once, and there must be some value of u for 
which (u mod m\,...,u mod m r ) = (ui ,..., u r ). 

METHOD 2 (“Constructive” proof). We can find numbers Mj for 1 < j < r 
such that 

Mj = 1 (modulo mj) and M 3 = 0 (modulo m*) for ky^j. (7) 

This follows because (5) implies that mj and m/my are relatively prime, so we 
may take 

M, = (8) 

by Euler’s theorem (exercise 1.2.4-28). Now the number 

u = a -j- ((ttiMi -f- U 2 M 2 +-1- u r M r — ajmodm) 

satisfies all the conditions of ( 6 ). | 


(9) 
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A very special case of this theorem was stated by the Chinese mathematician 
Sun Tsu, who gave a rule called tai-yen (“great generalization”). The date of 
his writing is very uncertain; it is thought to be between 280 and 473 A.D. [See 
Joseph Needham, Science and Civilization in China 3 (Cambridge University 
Press, 1959), 33-34, 119-120, for an interesting discussion.] Theorem C was 
apparently first stated and proved in its proper generality by Chhin Chiu Shao 
in his Shu Shu Chiu Chang (1247). Numerous early contributions to this theory 
have been summarized by L. E. Dickson in his History of the Theory of Numbers 
2 (New York: Chelsea, 1952), 57-64. 

As a consequence of Theorem C, we may use modular representation for 
numbers in any consecutive interval of m = mim 2 ...m r integers. For example, 
we could take a = 0 in (6), and work only with nonnegative integers u less 
than m. On the other hand, when addition and subtraction are being done, as 
well as multiplication, it is usually most convenient to assume that all the moduli 
mi, m 2 , ..., m r are odd numbers, so that m = mim 2 ... m r is odd, and to work 
with integers in the range 

m m 

-< u < — 

2 2 

which is completely symmetrical about zero. 

To perform the basic operations listed in (2), (3), and (4), we need to compute 
(Uj -f- Vj) mod rrij, (u 3 — v 3 ) mod rrij, and u 3 v 3 mod m 3 , when 0 < u 3 , v 3 < m 3 . 
If rrij is a single-precision number, it is most convenient to form u 3 v 3 mod m 3 
by doing a multiplication and then a division operation. For addition and 
subtraction, the situation is a little simpler, since no division is necessary; the 
following formulas may conveniently be used: 



(uj + Vj) mod rrij 


(■ u 3 — Vj) mod rrij 


(Uj + Vj, 

-f- v 3 — rrij, 

(Uj Vj, 

— Vj -f- rrij, 


if Uj + Vj < rrij) 
if Uj + Vj > rrij. 

if Uj — Vj > 0 ; 
if Uj — v 3 < 0 . 


(11) 

( 12 ) 


(Cf. Section 3.2. 1 . 1 .) In this case, since we want m to be as large as possible, it 
is easiest to let mi be the largest odd number that fits in a computer word, to let 
m 2 be the largest odd number < mi that is relatively prime to mi, to let m 3 be 
the largest odd number < m 2 that is relatively prime to both mi and m 2 , and 
so on until enough ra/s have been found to give the desired range m. Efficient 
ways to determine whether or not two integers are relatively prime are discussed 
in Section 4.5.2. 

As a simple example, suppose that we have a decimal computer whose words 
hold only two digits, so that the word size is 100. Then the procedure described 
in the previous paragraph would give 


mi — 99, m 2 = 97, m 3 = 95, m 4 = 91, m 5 — 89, m 6 — 83, (13) 


and so on. 
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On binary computers it is sometimes desirable to choose the rrij in a different 
way, by selecting 

rrij = 2 €j — 1. (14) 

In other words, each modulus is one less than a power of 2 . Such a choice of 
m 3 often makes the basic arithmetic operations simpler, because it is relatively 
easy to work modulo 2 e * — 1, as in ones’ complement arithmetic. When the 
moduli are chosen according to this strategy, it is helpful to relax the condition 
0 < u j < Tfij slightly, so that we require only 

0 < Uj < 2 €j , Uj = u (modulo 2 Cj — 1). (15) 

Thus, the value Uj = rrij = 2 €] — 1 is allowed as an optional alternative to 

Uj = 0, since this does not affect the validity of Theorem C, and it means we 
are allowing u 3 to be any ej -bit binary number. Under this assumption, the 
operations of addition and multiplication modulo rrij become the following: 

(Uj + Vj, if Uj + Vj < 2 e U 

u j 0 v j |((uj -j- Vj)mod 2 e *) -l-l, if Uj + v 3 > 2 e K ^ 

u 3 0 Vj = (ujVj mod 2 ej ) 0 [UjVj/2 ej \. (17) 

(Here 0 and 0 refer to the operations done on the individual components of 

(iti,..., u r ) and (vi ,..., v r ) when adding or multiplying, respectively, using the 
convention (15).) Equation ( 12 ) may be used for subtraction. These operations 
can be performed efficiently even when rrij is larger than the computer’s word 
size, since it is a simple matter to compute the remainder of a positive number 
modulo a power of 2, or to divide a number by a power of 2. In (17) we have 
the sum of the “upper half’ and the “lower half’ of the product, as discussed in 
exercise 3.2.1.1-8. 

If moduli of the form 2 ej — 1 are to be used, we must know under what con¬ 
ditions the number 2 e — 1 is relatively prime to the number 2-^ — 1. Fortunately, 
there is a very simple rule, 

gcd( 2 e — 1 , 2 f — 1 ) = 2 gcd(e ’ /) — 1 , (18) 

which states in particular that 2 e — 1 and 2 f — 1 are relatively prime if and only 
if e and f are relatively prime. Equation (18) follows from Euclid’s algorithm 
and the identity 

(2 e — 1) mod (2 f — 1) = 2 emod/ — 1. (19) 

(See exercise 6 .) Thus we could choose for example m\ = 2 35 — 1, m 2 = 2 34 — 1 , 
m 3 = 2 33 — 1 , m 4 = 2 31 — 1 , ms = 2 29 — 1 , if we had a computer with word 
size 2 35 ; this would permit efficient addition, subtraction, and multiplication of 
integers in a range of size mim 2 m 3 m 4 m 5 > 2 161 . 
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As we have already observed, the operations of conversion to and from 
modular representation are very important. If we are given a number u, its 
modular representation (ui,...,u r ) may be obtained by simply dividing u by 
mi, m r and saving the remainders. A possibly more attractive procedure, 
if u — (VmVm—i.. .Vo)b, is to evaluate the polynomial 

(... ( v m b -j- v m ^i)b -(-•’•) b -|- Vq 

using modular arithmetic. When b = 2 and when the modulus m 3 has the 
special form 2 gj — 1 , both of these methods reduce to quite a simple procedure: 
Consider the binary representation of u with blocks of e 3 bits grouped together, 

u = a t A -j- at—\A t 1 a\A -}- ao, (20) 


where A — 2 6j and 0 < for 0 < k < t. Then 


u = a t + a t —i -]-h fl i "f fl o (modulo 2 ej — 1 ), ( 21 ) 

since A = 1, so we obtain u 3 by adding the e 3 -bit numbers a t 0 • • • ® a\ 0 a 0 , 
using (16). This process is similar to the familiar device of “casting out nines” 
that determines u mod 9 when u is expressed in the decimal system. 

Conversion back from modular form to positional notation is somewhat 
more difficult. It is interesting in this regard to make a few side remarks about 
the way computers make us change our viewpoint towards mathematical proofs: 
Theorem C tells us that the conversion from (ui,...,u r ) to u is possible, and 
two proofs are given. The first proof we considered is a classical one that relies 
only on very simple concepts, namely the facts that 

i) any number that is a multiple of mi, of m 2 , ..., and of m r , must be a 
multiple of m\tn 2 . • • m r when the m ? ’s are pairwise relatively prime; and 

ii) if m things are put into m boxes with no two things in the same box, then 
there must be one in each box. 

By traditional notions of mathematical aesthetics, this is no doubt the nicest 
proof of Theorem C; but from a computational standpoint it is completely 
worthless. It amounts to saying, “Try u = a, a —(— 1, ... until you find a value 
for which u = Ui (modulo mi), ..., u = u r (modulo m r ).” 

The second proof of Theorem C is more explicit; it shows how to compute r 
new constants Mi, ..., M r , and to get the solution in terms of these constants 
by formula (9). This proof uses more complicated concepts (for example, Euler’s 
theorem), but it is much more satisfactory from a computational standpoint, 
since the constants , ..., M r need to be determined only once. On the 
other hand, the determination of M 3 by Eq. ( 8 ) is certainly not trivial, since the 
evaluation of Euler’s (^-function requires, in general, the factorization of m 3 into 
prime powers. Furthermore, Mj is likely to be a terribly large number, even if we 
compute only the quantity Mj modm (which will work just as well as M 3 in (9)). 
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Since Mj mod m is uniquely determined if (7) is to be satisfied (because of the 
Chinese remainder theorem), we can see that, in any event, Eq. (9) requires a 
lot of high-precision calculation, and such calculation is just what we wished to 
avoid by modular arithmetic in the first place. 

So we need an even better proof of Theorem C if we are going to have a 
really usable method of conversion from (ui,. . . ,u r ) to u. Such a method was 
suggested by H. L. Garner in 1958; it can be carried out using (£) constants c l3 
for 1 < i < j < r, where 

Cij m l = 1 (modulo my). (22) 

These constants c ZJ are readily computed using Euclid’s algorithm, since for any 
given i and j Algorithm 4.5.2X will determine a and b such that ami + bm,j = 
gcd (rrii,mj) = 1, and we may take c*y = a. When the moduli have the special 
form 2 ej — 1, a simple method of determining c %3 is given in exercise 6. 

Once the c*y have been determined satisfying (22), we can set 


Vl 

<— u\ mod mi, 


V 2 

<— (u 2 — Vi)c 12 modm 2 , 


V3 

<- ((U3 — Vi)ci3 — V 2 )c23lI10dm3, 

(23) 

V r 

^ • . ((^t* ^1) Cl r ^ 2 ) C 2 r * * * Vr —1 ^) _1 ^r rnod TYtr . 


Then 

u = v r m r —i ... m 2 mi -f- \- Vsm 2 mi + V 2 ^i + Vi 

(24) 

is a number satisfying the conditions 



0 < u < m, u = Uj (modulo m 3 ) for 1 < j < r. 

(25) 


(See exercise 8 ; another way of rewriting (23) that does not involve as many 
auxiliary constants is given in exercise 7.) Equation (24) is a mixed-radix repre¬ 
sentation of u, which may be converted to binary or decimal notation using the 
methods of Section 4.4. If 0 < u < m is not the desired range, an appropriate 
multiple of m can be added or subtracted after the conversion process. 

The advantage of the computation shown in (23) is that the calculation 
of Vj can be done using only arithmetic mod my, which is already built into the 
modular arithmetic algorithms. Furthermore, (23) allows parallel computation: 
We can start with (iq,..., v r )«— (ui mod mi,. .., u r modm r ), then at time j for 
1 < j < r we simultaneously set Vk *— [vk — vy) Cyfc mod m^ for j < k < r. 
An alternative way to compute the mixed-radix representation, allowing similar 
possibilities for parallelism, has been discussed by A. S. Fraenkel, Proc. ACM 
Nat. Conf. 19 (Philadelphia, 1965), El.4. 

It is important to observe that the mixed-radix representation (24) is suffi¬ 

cient to compare the magnitudes of two modular numbers. For if we know that 
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0 < u < m and 0 < u' < m, then we can tell if u < u* by first doing the 
conversion to (vi,..., v r ) and (v \,..., v' r ), then testing if v r < v' r , or if v r = v' r 
and v r _ 1 < v' r _ 1 , etc. It is not necessary to convert all the way to binary 
or decimal notation if we only want to know whether (u \,... ,u r ) is less than 

(tii j • • • > )* 

The operation of comparing two numbers, or of deciding if a modular number 
is negative, is intuitively very simple, so we would expect to have a much easier 
method for making this test than the conversion to mixed-radix form. But the 
following theorem shows that there is little hope of finding a substantially better 
method, since the range of a modular number depends essentially on all bits of 
all the residues (ui,..., u r ): 

Theorem S (Nicholas Szabo, 1961). In terms of the notation above, assume that 
mi < y/rh, and let L be any value in the range 


mi < L < m — mi. (26) 

Let g be any function such that the set {{ 7 ( 0 ), g(l), ..., < 7(7711 — 1)} contains fewer 
than mi values. Then there are numbers u and v such that 

g(u mod mi) = < 7 ( 7 ; mod mi), umodm 3 — 7 ; mod m 3 for 2 < j < r; (27) 

0 < 77 <L< 7 ;<m. (28) 

Proof. By hypothesis, there must exist numbers u 7 ^ v satisfying (27), since g 
must take on the same value for two different residues. Let ( u , v) be a pair of 
values with 0 < u < v < m satisfying (27), for which u is a minimum. Since 
u' = u — mi and 7/ = v — mi also satisfy (27), we must have u' < 0 by 
the minimality of u. Hence u < mi < L; and if (28) does not hold, we must 
have v < L. But v > u, and v — u is a multiple of m 2 .. .m r = m/mi, so 
v > v — u > m/mj > mi. Therefore, if (28) does not hold for (u,v), it will be 
satisfied for the pair ( u", v") = (v — mi, u -f m — mi). | 

Of course, a similar result can be proved for any mj in place of mi; and we 
could also replace (28) by the condition “a < u < a -|- L < v < a m” with 
only minor changes in the proof. Therefore Theorem S shows that many simple 
functions cannot be used to determine the range of a modular number. 

Let us now reiterate the main points of the discussion in this section: Mod¬ 
ular arithmetic can be a significant advantage for applications in which the 
predominant calculations involve exact multiplication (or raising to a power) 
of large integers, combined with addition and subtraction, but where there is 
very little need to divide or compare numbers, or to test whether intermediate 
results “overflow” out of range. (It is important not to forget the latter restric¬ 
tion; methods are available to test for overflow, as in exercise 12 , but they are 
in general so complicated that they nullify the advantages of modular arith¬ 
metic.) Several applications of modular computations have been discussed by H. 
Takahasi and Y. Ishibashi, Information Proc. in Japan 1 (1961), 28-42. 
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An example of such an application is the exact solution of linear equations 
with rational coefficients. For various reasons it is desirable in this case to 
assume that the moduli mi, m 2 , ..., m r are all large prime numbers; the linear 
equations can be solved independently modulo each mj. A detailed discussion of 
this procedure has been given by I. Borosh and A. S. Fraenkel [Math. Comp. 20 
(1966), 107-112]. By means of their method, the nine independent solutions of 
a system of 111 linear equations in 120 unknowns were obtained exactly in less 
than one hour’s running time on a CDC 1604 computer. The same procedure 
is worthwhile also for solving simultaneous linear equations with floating point 
coefficients, when the matrix of coefficients is ill-conditioned. The modular 
technique (treating the given floating point coefficients as exact rational numbers) 
gives a method for obtaining the true answers in less time than conventional 
methods can produce reliable approximate answers! [See M. T. McClellan, JACM 
20 (1973), 563-588, for further developments of this approach; and see also 
E. H. Bareiss, J. Inst. Math, and Appl. 10 (1972), 68-104, for a discussion of its 
limitations.] 

The published literature concerning modular arithmetic is mostly oriented 
towards hardware design, since the carry-free properties of modular arithmetic 
make it attractive from the standpoint of high-speed operation. The idea was first 
published by A. Svoboda and M. Yalach in the Czechoslovakian journal Stroje 
na Zpracovam Informac! 3 (1955), 247-295; then independently by H. L. Garner 
[IRE Trans. EC-8 (1959), 140-147]. The use of moduli of the form 2 tj — 1 was 
suggested by A. S. Fraenkel [JACM 8 (1961), 87-96], and several advantages of 
such moduli were demonstrated by A. Schonhage [Computing 1 (1966), 182-196]. 
See the book Residue Arithmetic and its Applications to Computer Technology 
by N. S. Szabo and R. I. Tanaka (New York: McGraw-Hill, 1967), for additional 
information and a comprehensive bibliography of the subject. A Russian book 
published in 1968 by I. fa. Akushskii and D. I. Iuditskii includes a chapter about 
complex moduli [cf. Rev. Romaine des Math. 15 (1970), 159-160]. 

Further discussion of modular arithmetic can be found in Section 4.3.3B. 


EXERCISES 

1. [20] Find all integers u that satisfy all of the following conditions: umod7 = 1, 
umod 11 — 6, umod 13 = 5, 0 < u < 1000. 

2. [ M20 ] Would Theorem C still be true if we allowed a, u\, U 2 , ■. ., u r and u to be 
arbitrary real numbers (not just integers)? 

► 3. [M26] ( Generalized Chinese Remainder Theorem.) Let mi, m 2 , m r be 
positive integers. Let m be the least common multiple of mi, m 2 , ..., m r , and let <2, 
u\, U 2 , ■ ■ ■, u r be any integers. Prove that there is exactly one integer u that satisfies 
the conditions 


a < u < a -f- m, 


u = Uj (modulo mj), 1 < j < r, 
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provided that 


Ui = Uj (modulo gcd1 < i < j < r; 

and there is no such integer u when the latter condition fails to hold. 

4. [20] Continue the process shown in (13); what would mi, m%, mg, ... be? 

► 5. [ M23] Suppose that the method of (13) is continued until no more my can be 
chosen; does this method give the largest attainable value mim, 2 ... ra r such that the 
m 3 are odd positive integers less than 100 that are relatively prime in pairs? 

6. [M22] Let e, /, g be nonnegative integers, (a) Show that 2 e = 2 f (modulo 2 9 — 1) 
if and only if e = f (modulo g). (b) Given that emodf = d and cemod/ = 1, prove 
that 

((1 -f 2 d H-[- 2 (c_1)d ) ■ (2 e — 1)) mod (2^ — 1) = 1. 

(Thus, we have a comparatively simple formula for the inverse of 2 e — 1 , modulo 2 f — 1, 
as required in (22).) 

► 7. [ M21 ] Show that (23) can be rewritten as follows: 

v\ •<— u\ mod mi, 

V 2 4 — (u 2 — Ui)ci 2 mod m 2 , 

i>3 4 - (u 3 — (v x + miu 2 ))ci 3 c 2 3 modm 3 , 


v r 4 - (u r — (t>i + mi{v 2 + m 2 (v 3 -j -(- m r — 2 V r — i) ...))) c lr ... C( r —i) r modm r . 

If the formulas are rewritten in this way, we see that only r — 1 constants Cj = 
Cij ... C(j— i)j mod m 3 are needed instead of r(r — l)/2 constants c %3 as in (23). Discuss 
the relative merits of this version of the formula as compared to (23), from the stand¬ 
point of computer calculation. 

8. [M21] Prove that the number u defined by (23) and (24) satisfies (25). 

9. \M20\ Show how to go from the values vi, ..., v r of the mixed-radix notation 
(24) back to the original residues U\, . .., u r , using only arithmetic mod m 3 to compute 
the value of Uj. 

10. [M25] An integer u that lies in the symmetrical range (10) might be represented 
by finding the numbers u\, ..., u r such that u = u 3 (modulo m 3 ) and — m 3 f2 < 
u 3 < m 3 / 2, instead of insisting that 0 < u 3 < m 3 as in the text. Discuss the modular 
arithmetic procedures that would be appropriate in connection with such a symmetrical 
representation (including the conversion process, (23)). 

11. [ M2S ] Assume that all the mj are odd, and that u — (ui,... ,u r ) is known to 
be even, where 0 < u < m. Find a reasonably fast method to compute u/2 using 
modular arithmetic. 

12. [M10] Prove that, if 0 < u,v < m, the modular addition of u and v causes 
overflow (i.e., is outside the range allowed by the modular representation) if and only 
if the sum is less than u. (Thus the overflow detection problem is equivalent to the 
comparison problem.) 
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► 13. [. M25 ] (Automorphic numbers.) An n-place decimal number x > 1 is called an 
“automorph” by recreational mathematicians if the last n digits of x 2 are equal to x. 
For example, 9376 is a 4-place automorph, since 9376 2 = 87909376. [See Scientific 
American 218 (January 1968), 125.] 

a) Prove that an n-place number x > 1 is an automorph if and only if x mod 5 n = 0 
or 1 and xmod2 n = 1 or 0, respectively. (Thus, if mi = 2 n and m 2 = 5 n , the 
only two n-place automorphs are the numbers Mi and M 2 in (7).) 

b) Prove that if x is an n-place automorph, then (3x 2 — 2x 3 )modl0 2n is a 2n-place 
automorph. 

c) Given that cx = 1 (modulo y), find a simple formula for a number c' depending 
on c and x but not on y, such that c'x 2 = 1 (modulo y 2 ). 

*4.3.3. How Fast Can We Multiply? 

The conventional method for multiplication in positional number systems, 
Algorithm 4.3.1M, requires approximately cmn operations to multiply an ra-digit 
number by an n-digit number, where c is a constant. In this section, let us assume 
for convenience that m = n, and let us consider the following question: Does 
every general computer algorithm for multiplying two n-digit numbers require 
an execution time proportional to n 2 , as n increases? 

(In this question, a “general” algorithm means one that accepts, as input, the 
number n and two arbitrary n-digit numbers in positional notation, and whose 
output is their product in positional form. Certainly if we were allowed to choose 
a different algorithm for each value of n, the question would be of no interest, 
since multiplication could be done for any specific value of n by a “table-lookup” 
operation in some huge table. The term “computer algorithm” is meant to imply 
an algorithm that is suitable for implementation on a digital computer such as 
MIX, and the execution time is to be the time it takes to perform the algorithm 
on such a computer.) 

A. Digital methods. The answer to the above question is, rather surprisingly, 
“No,” and, in fact, it is not very difficult to see why. For convenience, let us 
assume throughout this section that we are working with integers expressed in 
binary notation. If we have two 2n-bit numbers u = (u 2 n— 1 • • • ^ 1 ^ 0)2 and v = 
{v 2n —i ... V\v 0 ) 2 , we can write 

= 2 n Ur + Uo, v = 2 n Vi + Vo, (1) 

where U\ = (u 2 n —1 • • • u n )2 is the “most significant half’ of the number u and 
Uo = (tin —1 • • • 1 ^ 0)2 is the “least significant half’; similarly Vi = (v 2n — 1 ... t> n ) 2 
and Vo = (v n —1 • • • ^ 0 ) 2 - Now we have 

uv = (2 2 " + 2 n )CJ 1 y 1 + 2 n (Ui - Uo)(Vo - V,) + (2" + 1 )U 0 V 0 . (2) 

This formula reduces the problem of multiplying 2ri-bit numbers to three multi¬ 
plications of n-bit numbers, namely U\V\, ( U\ — Uq)(Vq — Vi), and UqVq, plus 
some simple shifting and adding operations. 
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Formula (2) can be used for double-precision multiplication when we want a 
quadruple-precision result, and it will be just a little faster than the traditional 
method on many machines. But the main advantage of (2) is that we can use 
it to define a recursive process for multiplication that is significantly faster than 
the familiar order-n 2 method when n is large: If T(n) is the time required to 
perform multiplication of n-bit numbers, we have 

T(2ri) < 3 T(n) -j- cn (3) 

for some constant c, since the right-hand side of (2) uses just three multiplications 
plus some additions and shifts. Relation (3) implies by induction that 

T(2 k ) < c(3 fc - 2 k ), k> 1, (4) 

if we choose c to be large enough so that this inequality is valid when k = 1; 
therefore we have 

T(n ) < T(2 flgnl ) < c(3 rignl - 2^*"!) 

< 3c-3 lg,, = 3cn lg3 . (5) 

Relation (5) shows that the running time for multiplication can be reduced from 
order n 2 to order n lg3 n 1 585 , so the recursive method is much faster than 
the traditional method when n is large. 

(A similar but more complicated method for doing multiplication with run¬ 
ning time of order n lg3 was apparently first suggested by A. Karatsuba in 
Doklady Akad. Nauk SSSR 145 (1962), 293-294 [English translation in Soviet 
Physics-Doklady 7 (1963), 595-596]. Curiously, this idea does not seem to have 
been discovered before 1962; none of the “calculating prodigies” who have be¬ 
come famous for their ability to multiply large numbers mentally have been 
reported to use any such method, although formula (2) adapted to decimal nota¬ 
tion would seem to lead to a reasonably easy way to multiply eight-digit numbers 
in one’s head.) 

The running time can be reduced still further, in the limit as n approaches 
infinity, if we observe that the method just used is essentially the special case 
r = 1 of a more general method that yields 

T((r + l)n) < (2r + l)T(n) + cn (6) 

for any fixed r. This more general method can be obtained as follows: Let 
u = (u (r +i) n _i.. .UiU 0 ) 2 and v = (t^r+ijn-i • ■ • 1 ^ 0)2 
be broken into r + 1 pieces, 


u = U r 2 rn -]-1- U\2 n + Uo, 


v = V r 2 rn -\ - \- V\2 n -|- Vo, (7) 
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where each Uj and each V 3 is an n-bit number. Consider the polynomials 
U(x) = U r x r + • • ■ + Uix + U 0 , V{x) = V r x r + >-- + V 1 x + V 0 , (8) 

and let 

W(x) = U(x)V(x) = W 2r x 2r H- \-WiX + W 0 . (9) 

Since u = U( 2") and v = V(2 n ), we have uv = W(2 n ), so we can easily 
compute uv if we know the coefficients of W(x). The problem is to find a good 
way to compute the coefficients of W(x) by using only 2r + 1 multiplications of 
n-bit numbers plus some further operations that involve only an execution time 
proportional to n. This can be done by computing 

U{ 0)^(0) = W{ 0), U{ 1)7(1) = W( 1), ..., U{2r)V{2r) = W{2r). (10) 

The coefficients of a polynomial of degree 2r can be written as a linear com¬ 
bination of the values of that polynomial at 2r -(- 1 distinct points; computing 
such a linear combination requires an execution time at most proportional to n. 
(Actually, the products U(j)V(j) are not strictly products of n-bit numbers, but 
they are products of at most (n-j-t)-bit numbers, where t is a fixed value depend¬ 
ing on r. It is easy to design a multiplication routine for (n + t)-bit numbers 
that requires only T(n) + cin operations, where T(n) is the number of operations 
needed for n-bit multiplications, since two products of t-bit by n-bit numbers 
can be done in C 2 n operations when t is fixed.) Therefore we obtain a method 
of multiplication satisfying (6). 

Relation (6) implies that T(n) < c 3 n logr + l(2r+1 ^ < c^n 1 ^ 0 ^-*- 1 2 , if we 
argue as in the derivation of (5), so we have now proved the following result: 

Theorem A. Given e > 0, there exists a multiplication algorithm such that the 
number of elementary operations T(n) needed to multiply two n-bit numbers 
satisfies 

T(n) < c(e)n 1+£ , (11) 

for some constant c(e) independent of n. | 

This theorem is still not the result we are after. It is unsatisfactory for 
practical purposes in that the method becomes much more complicated as e —► 0 
(and therefore as r —► oo), causing c(e) to grow so rapidly that extremely huge 
values of n are needed before we have any significant improvement over (5). And 
it is unsatisfactory for theoretical purposes because it does not make use of the 
full power of the polynomial method on which it is based. We can obtain a 
better result if we let r vary with n, choosing larger and larger values of r as 
n increases. This idea is due to A. L. Toom [Doklady Akad. Nauk SSSR 150 
(1963), 496-498, English translation in Soviet Mathematics 3 (1963), 714-716], 
who used it to show that computer circuitry for multiplication of n-bit numbers 
can be constructed involving a fairly small number of components as n grows. 
S. A. Cook [On the minimum computation time of functions (Thesis, Harvard 
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University, 1966), 51-77] later showed how Toom’s method can be adapted to 
fast computer programs. 

Before we discuss the Toom-Cook algorithm any further, let us study a small 
example of the transition from U(x) and V(x) to the coefficients of W(x). This 
example will not demonstrate the efficiency of the method, since the numbers 
are too small, but it points out some useful simplifications that we can make in 
the general case. Suppose that we want to multiply u — 1234 times v = 2341; 
in binary notation this is u = (0100 1101 0010)2 times v = (1001 0010 0101 ) 2 - 
Let r — 2; the polynomials U(x), V(x) in (8) are 

U(x) = 4x 2 + 13x -f 2, V(x) = 9z 2 + 2x -f 5. 


Hence we find, for W(x) = U(x)V(x), 


m = 

2, 

m = 

19, 

U( 2) = 

44, 

m = 

77, 

U( 4) = 

118; 

V(0) = 

5, 

v(l) = 

16, 

V(2) = 

45, 

F(3) = 

92, 

V(4) = 

157; 

W(0) = 

10, 

W(l) = 

304, 

W(2) = 

1980, 

n 3) = 

7084, 

W{ 4) = 

18526. 


( 12 ) 


Our job now is to compute the five coefficients of W(x) from the latter five values. 

There is an attractive little algorithm that can be used to compute the 
coefficients of a polynomial W(x) = W m — \X m ~ x —|— - - - —|— VTTx —(— Wo when the 
values VU(0), W(l),..., W(m — 1) are given: Let us first write 

w(x) = e m -i + e m - 2 x+ ■ • ■ + 0: A + 0 O , (i3) 

where x- = x(x — 1) ... (x — A: -(— 1), and where the coefficients are unknown. 
The falling factorial powers have the important property that 

W(x + 1) — W(x) = (m — l)0 m _i x 0 ^ 1 + (m — 2)6> m __ 2 x^^ -\ -f 0i) 


hence by induction we find that, for all k > 0, 


QW+fc-i)+QW+k-2) - • • •+(-1 ) k w(x) 

m ~ 1 ) 0 m _! + ( m - 2 ^ m _ 2 x m=2=k + ... + Q g k . 


m —2 2 / 
k 


m - 2 ~ k + ••• + ( ' k )e k . (H) 


Denoting the left-hand side of (14) by (l/k\) A K W(x), we see that 


^^W = K(Fhj! Afc_I ^ + 1) - 


(*-l)!' 



and (1/fc!) A fc W(0) = 9k- So the coefficients 0j can be evaluated using a very 
simple method, illustrated here for the polynomial W(x) in (12): 


10 

304 

1980 

7084 

18526 


294 

1676 

5104 

11442 


1382/2 = 691 
3428/2 = 1714 
6338/2 = 3169 


1023/3 = 341 
1455/3 = 485 


144/4 = 36 (15) 
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The leftmost column of this tableau is a listing of the given values of VV'(O), W(l), 
..., W( 4); the kth succeeding column is obtained by computing the difference 
between successive values of the preceding column and dividing by -k. The 
coefficients Oj appear at the top of the columns, so that 0 O — 10, B\ = 294, 
..., 64 — 36, and we have 

W(x) = 36a:- + 341x^ + 691x^ + 294s 1 + 10 

= (((36(x — 3) + 341)(x — 2) + 691)(x — 1) + 294)x + 10. (16) 

In general, we can write 

W(x) = (... (( 9 m —i(x—m+2) -)- 0 m _2)(x—ra-)-3) -|- 0 m _ s)(x — m-\- 4) 

H-+ #i ) x + 9q, 

and this formula shows how the coefficients W m — 1 , •.., Wi, Wo can be obtained 
from the 0’s: 


(17) 


10 

Here the numbers below the horizontal lines successively show the coefficients of 
the polynomials 

9m —1> 9m —1 (x -J- m -f- 2) -f- 0 m —2 j 

($m- i(x — m 4- 2) + 0 m _ 2 )(x — m + 3) + 0 m _ 3 , etc. 

From this tableau we have 

W(x) = 36x 4 + 125x 3 + 64x 2 + 69x + 10, 



so the answer to our original problem is 1234 • 2341 = VF(16), where VF(16) is 
obtained by adding and shifting. A generalization of this method for obtaining 
coefficients is discussed in Section 4.6.4. 

The basic Stirling number identity, 



CF 


*+••• + 



Eq. 1.2.6-41, shows that if the coefficients of W(x) are nonnegative, so are the 
numbers 9j, and in such a case all of the intermediate results in the above com - 
putation are nonnegative. This further simplifies the Toom-Cook multiplication 
algorithm, which we will now consider in detail. 
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Algorithm C ( High-precision multiplication of binary numbers). Given a positive 
integer n and two nonnegative n-bit integers u and v, this algorithm forms their 
2n-bit product, w. Four auxiliary stacks are used to hold the long numbers that 
are manipulated during the procedure: 

Stacks U, V : Temporary storage of U(j) and V(j) in step C4. 

Stack C: Numbers to be multiplied, and control codes. 

Stack W: Storage of W(j). 

These stacks may contain either binary numbers or special control symbols called 
code-1, code-2, and code-3. The algorithm also constructs an auxiliary table of 
numbers q k , r k ; this table is maintained in such a manner that it may be stored 
as a linear list, where a single pointer that traverses the list (moving back and 
forth) may be used to access the current table entry of interest. 

(Stacks C and W are used to control the recursive mechanism of this multi¬ 
plication algorithm in a reasonably straightforward manner that is a special case 
of general procedures discussed in Chapter 8.) 

Cl. [Compute q, r tables.] Set stacks U, V, C, and W empty. Set 

k <- 1, qo <- < 7 i <- 16, r 0 <- r x <- 4, Q <- 4, R <- 2. 
Now if q } c _i + qk < n, set 

k<-k + l, Q^Q + R, Qk *- 2®, r k ^2 R , 

and repeat this operation until q k —i -\~qic > n. (Note: The calculation of 
R [s/Q\ does not require a square root to be taken, since we may simply 
set R «— R -j- 1 if (R -\- 1) 2 < Q and leave R unchanged if (R -j- l) 2 > Q\ 
see exercise 2. In this step we build the sequences 

k = 0 1 2 3 4 5 6... 

q k = 2 4 2 4 2 6 2 8 2 10 2 13 2 16 

r k = 2 2 2 2 2 2 2 2 2 3 2 3 2 4 

The multiplication of 70000-bit numbers would cause this step to terminate 
with k = 6, since 70000 < 2 13 + 2 16 .) 

C2. [Put u, v on stack.] Put code-1 on stack C, then place u and v onto stack C 
as numbers of exactly q k —i + q k bits each. 

C3. [Check recursion level.] Decrease k by 1. If k = 0, the top of stack C 
now contains two 32-bit numbers, u and v; remove them, set w uv using 
a built-in routine for multiplying 32-bit numbers, and go to step CIO. If 
k > 0, set r <- r k , q <- q k , p q k — X + q k , and go on to step C4. 
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Fig. 8. Toom-Cook algorithm for high-precision multiplication. 


C4. [Break into r +1 parts.) Let the number at the top of stack C be regarded 
as a list of r + 1 numbers with q bits each, (U r ... UiUq) 2 q. (The top of 
stack C now contains an (r -j- l)q — (qk -f- <?fc+i)-bit number.) For j = 0, 
1, ..., 2r, compute the p-bit numbers 

(... ( U r j + U r —i)j -[- \- Ui)j + Uo = U(j) 

and successively put these values onto stack U. (The bottom of stack U 
now contains t/(0), then comes U( 1), etc., with U(2r) on top. Note that 

U(j) < U(2r) < 2 g ((27f + (2r) r 1 + •.. + l) < 2 £ ?+ 1 (2 r) r < 2^, 

by exercise 3.) Then remove U r ...UiU 0 from stack C. 

Now the top of stack C contains another list of r -\-1 g-bit numbers, 
V r ... V\ Vo, and the p-bit numbers 

(... (Vrj + V,_i)i + • • • + Vi)j + Vo = V(j) 

should be put onto stack V in the same way. After this has been done, 
remove V r . ..ViVq from stack C. 

C5. [Recurse.) Successively put the following items onto stack C, at the same 
time emptying stacks U and V: 

code-2, V(2 r), U(2r), code-3, V(2r — 1), U(2r — 1), ..., 

code-3, V(l), U( 1), code-3, 1/(0), f/(0). 

Go back to step C3. 

C6. [Save one product.) (At this point the multiplication algorithm has set w to 
one of the products W(j) — U(j)V(j ).) Put w onto stack W. (This number 
w contains 2 (q k + q k c _i) bits.) Go back to step C3. 
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C7. [Find 0’s.] Set r «— r k , q <— qk, V <— qk —1 + Qk- (At this point stack W 
contains a sequence of numbers ending with W(0), VF(1), ..., VF(2r) from 
bottom to top, where each W(j) is a 2p-bit number.) 

Now for j — 1, 2, 3, ..., 2 r, perform the following loop: For t = 2r, 
2r — 1, 2r — 2, ..., j, set W(t) <— (W(t) — W(t — 1 ))/j. (Here j must 
increase and t must decrease. The quantity (W(t)— W(t — 1 ))/j will always 
be a nonnegative integer that fits in 2p bits; cf. (15).) 

C8. [Find W' s.] For j = 2r — 1, 2r — 2, ..., 1, perform the following loop: For 
t = j t j 1, ..., 2r — 1, set W(t) <— W(t) — jW(t + 1)- (Here j must 
decrease and t must increase. The result of this operation will again be a 
nonnegative 2p-bit integer; cf. (17).) 

C9. [Set answer.] Set w to the 2 (q k -f- q/c+xj-bit integer 

(... (W(2r)2 q + W(2r - l))2 g -\ -f- W{l))2 q + W{ 0). 

Remove VF(2r), ..., 1F(0) from stack W. 

CIO. [Return.] Set k <— k -j- 1. Remove the top of stack C. If it is code-3, go 
to step C6. If it is code-2, put w onto stack W and go to step C7. And if 
it is code-1, terminate the algorithm (w is the answer). | 

Let us now estimate the running time, T(ri), for Algorithm C, in terms of 
some things we shall call “cycles,” i.e., elementary machine operations. Step Cl 
takes 0 (qk ) cycles, even if we represent the number internally as a long string 

of qk bits followed by some delimiter, since qk + Qk —i H- \~ Qo will be O(q^). 

Step C2 obviously takes 0(qk) cycles. 

Now let tk denote the amount of computation required to get from step C3 to 
step CIO for a particular value of k (after k has been decreased at the beginning of 
step C3). Step C3 requires 0(q) cycles at most. Step C4 involves r multiplications 
of p-bit numbers by (lg 2r)-bit numbers, and r additions of p-bit numbers, all 
repeated 4r -f 2 times. Thus we need a total of 0(r 2 q log r) cycles. Step C5 
requires moving 4r + 2 p-bit numbers, so it involves 0(rq) cycles. Step C6 
requires 0(q) cycles, and it is done 2r + 1 times per iteration. The recursion 
involved when the algorithm essentially invokes itself (by returning to step C3) 
requires tk—i cycles, 2r -f-1 times. Step C7 requires 0(r 2 ) subtractions of p-bit 
numbers and divisions of 2p-bit by (lg2r)-bit numbers, so it requires 0 (r 2 q log r) 
cycles. Similarly, step C8 requires 0(r 2 q log r) cycles. Step C9 involves 0(rq) 
cycles, and CIO takes hardly any time at all. 

Summing up, we have T(n) = 0(qk) + 0(qk) + tk— i, where (if q = q k and 
r = r k ) the main contribution to the running time satisfies 

t k = O(q) + 0 (r 2 q log r) -f 0 (rq) + (2 r + 1 )0(q) + 0 (r 2 q log r) 

+ 0 (r 2 q\ogr) + 0 (rq) -f 0 (q) + (2r + l)t k -i 

= 0(r 2 q\ogr) + (2 r + l)t fc - 


i- 
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Thus there is a constant c such that 

tk < cr 2 k q k \gr k +(2 r k + l)4_i. 

To complete the estimation of t k we can prove by brute force that 

tk < Cq fc+1 2 2 -V'*'J-.+. (18) 

for some constant C. Let us choose C > 20c, and let us also take C large enough 
so that (18) is valid for k < ko, where k 0 will be specified below. Then when 
k > k 0 , let Q k = lg q k , R k — \gr k ] we have by induction 

tk < cq k r 2 k lg r k + (2 r k + 1 )Cq k 2 2s ^ = Cq k+1 2 2 ^'^ r '( Vl + q 2 ), 
where 

Vi = 2 R *- 2 - 5 \/e^ < ±-R k 2~ R k < 0.05, 

1)2 = (2 + T) 2 2 ' 5( '/«‘- v /< 3‘+ i ) 2-1/4 < o.85 ( 

since 

y/Qk+i — VQk — yjQk + lVQk\ — VQk -*■ i 

as k —► 00. It follows that we can find ko such that rj 2 < 0.95 for all k > k 0 , 
and this completes the proof of (18) by induction. 

Finally, therefore, we may compute T(n ). Since n > q k —i ~f qk— 2 , we have 
q k —1 < n; hence 


r fc _, =2L\/>8 «‘-iJ < 2 V^, and q k = < n2V^. 

Thus 

t fc _i < C<? fc 2 2 5 v^^ < Cn2V^ +2 - 5(l /^+ 1 ) j 

and, since T(n) = 0(q k ) 4—1? we have finally derived the following theorem: 

Theorem C. There is a constant c 0 such that the execution time of Algorithm C 
is less than Con2 35y ^ lgn cycles . | 

Since n2 3 5 v /lgn = n 1+3 °/v / ’ 1 s n J this result is noticeably stronger than 
Theorem A. By adding a few complications to the algorithm, pushing the ideas 
to their apparent limits (see exercise 5), we can improve the estimated execution 
time to 


T(n) = 0(n2^ 2 lgri logn). 


(19) 



4.3.3 


HOW FAST CAN WE MULTIPLY? 287 


B. A modular method. There is another way to multiply large numbers very 
rapidly, based on the ideas of modular arithmetic as presented in Section 4.3.2. 
It is very hard to believe at first that this method can be of advantage, since a 
multiplication algorithm based on modular arithmetic must include the choice of 
moduli and the conversion of numbers into and out of modular representation, 
besides the actual multiplication operation itself. In spite of these formidable 
difficulties, A. Schonhage discovered that all of these operations can be carried 
out quite rapidly. 

In order to understand the essential mechanism of Schonhage’s method, we 
shall look at a special case. Consider the sequence defined by the rules 

Qo = 1) <7fc+1 = 3 q k — 1, (20) 

so that q k = 3 k — 3 k ~~ 1 — • • • — 1 = J(3 fc + 1). We will study a procedure 
that multiplies (18 q k -(- 8)-bit numbers, in terms of a method for multiplying 
(18q k —i + 8)-bit numbers. Thus, if we know how to multiply numbers having 
(18q 0 + 8) = 26 bits, the procedure to be described will show us how to multiply 
numbers of (18qi + 8) = 44 bits, then 98 bits, then 260 bits, etc., eventually 
increasing the number of bits by almost a factor of 3 at each step. 

Let p k = 18 q k + 8- When multiplying pjfbit numbers, the idea is to use 
the six moduli 


mi = 2 6qfc—1 — 1, m 2 = 2 6qfc+1 — 1, m 3 = 2 6qfc + 2 — 1, 

(21) 

m 4 — 2 6<?fc + 3 — 1, m 5 = 2 6qfc + 5 — 1, m 6 = 2 6qfc + 7 — 1. 

These moduli are relatively prime, by Eq. 4.3.2-18, since the exponents 

6<7fc-l, 6<? fc + l, 6q fc + 2, Qq k + 3, 6q fc + 5, Qq k + 7 (22) 

are always relatively prime (see exercise 6). The six moduli in (21) are capable 
of representing numbers up to m = mim2ro3m4m5m 6 > 2 36qfc + 16 = 2 2pk , so 
there is no chance of overflow in the multiplication of p*;-bit numbers u and v. 
Thus we may use the following method, when k > 0: 

a) Compute u\ — u mod mi, ..., u 3 = u mod m 3 , and v\ = v mod mi, ..., 
Vs = v mod m 6 . 

b) Multiply U\ by V\, u 2 by v 2 , ..., u$ by v 3 . These are numbers of at most 
6<7fc + 7 = 18q fc —i + l < Pk— i bits, so the multiplications can be performed 
by using the assumed p^—i-bit multiplication procedure. 

c) Compute Wi = u\V\ mod mi, w 2 - u 2 v 2 mod m2, ..., w§ = uqVq mod me. 

d) Compute w such that 0 < w < m, wmodmi = Wi, ..., w mod me = w 6 . 

Let t k be the amount of time needed for this process. It is not hard to see 
that operation (a) takes 0{p k ) cycles, since the determination of wmod(2 e — 1) 
is quite simple (like “casting out nines”), as shown in Section 4.3.2. Similarly, 
operation (c) takes 0(p k ) cycles. Operation (b) requires essentially 6t k -i cycles. 
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This leaves us with operation (d), which seems to be quite a difficult computation; 
but Schonhage has found an ingenious way to perform step (d) in 0(pk log pk) 
cycles, and this is the crux of the method. As a consequence, we have 

t h = +0(p k log Pk). 

Since Pk — 3 fc+2 -f- we can show that the time for n-bit multiplication is 

T(n) = 0(N l ° g3 6 ) = 0(N 163 ). (23) 


(See exercise 7.) 

Although the modular method is more complicated than the 0(n lg3 ) pro¬ 
cedure discussed at the beginning of this section, Eq. (23) shows that it does, 
in fact, lead to an execution time substantially better than 0{n 2 ) for the mul¬ 
tiplication of n-bit numbers. Thus we can improve on the classical method by 
using either of two completely different approaches. 

Let us now analyze operation (d) above. Assume that we are given a set of 
positive integers e\ < e 2 < • • • < e r , relatively prime in pairs; let 

m 1 =2 ei —\, m 2 = 2 62 —1, ..., m r = 2^—1. (24) 

We are also given numbers uq, ..., w r such that 0 < Wj < rrij. Our job is to 
determine the binary representation of the number w that satisfies the conditions 


0 < w < mim 2 .. .ra r , 

w = wi (modulo mi), ..., w = w r (modulo m r ). 

The method is based on (23) and (24) of Section 4.3.2. First we compute 

w'j — (... ((wj — w[) cij — w' 2 ) c 2 j - w'j_ i) mod rrij, (26) 

for j = 2, ..., r, where w[ = w\ mod mi; then we compute 

w = (... (w' r m r —i + w' r _ i)m r _ 2 H- ^^ 2)^1 + (27) 

Here dj is a number such that c^m^ = 1 (modulo rrij); these numbers are 
not given, they must be determined from the e/s. 

The calculation of (26) for all j involves Q additions modulo rrij, each 
of which takes 0(e r ) cycles, plus (£) multiplications by c^-, modulo rrij. The 
calculation of w by formula (27) involves r additions and r multiplications by rrij ; 
it is easy to multiply by rrij , since this is just adding, shifting, and subtracting, 
so it is clear that the evaluation of Eq. (27) takes 0(r 2 e r ) cycles. We will soon 
see that each of the multiplications by Cy, modulo rrij, requires only 0(e r loge r ) 
cycles, and therefore it is possible to complete the entire job of conversion in 
0(r 2 e r log e r ) cycles. 
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The above observations leave us with the following problem to solve: Given 
positive integers e < / and a nonnegative integer u < 2 $, compute the value of 
(cu) mod (2 f — 1), where c is the number such that (2 e — l)c = 1 (modulo 2^ — 1); 
and the computation must be done in 0(/log/) cycles. The result of exercise 
4.3.2-6 gives a formula for c that suggests a suitable procedure. First we find 
the least positive integer b such that 


be = 1 (modulo /). (28) 

This can be done using Euclid’s algorithm in 0((log/) 3 ) cycles, since Euclid’s 
algorithm applied to e and / requires 0(log/) iterations, and each iteration 
requires 0((log/) 2 ) cycles; alternatively, we could be very sloppy here without 
violating the total time constraint, by simply trying b = 1, 2, etc., until (28) 
is satisfied, since such a process would take 0(/log/) cycles in all. Once b has 
been found, exercise 4.3.2-6 tells us that 



A brute-force multiplication of (cu) mod (2^—1) would not be good enough to 
solve the problem, since we do not know how to multiply general /-bit numbers 
in 0(/log/) cycles. But the special form of c provides a clue: The binary 
representation of c is composed of bits in a regular pattern, and Eq. (29) shows 
that the number c[2b) can be obtained in a simple way from c\b]. This suggests 
that we can rapidly multiply a number u by c[b] if we build c[b\u up in lg b steps 
in a suitably clever manner, such as the following: Let the binary notation for b 
be 


b = (b s . ..b 2 bibo) 2 ; 


we may calculate the sequences ak, d k , u k , v k defined by the rules 


«o = e, a k = 2a k _i mod /; 

do = b 0 e , d k = (<4_! -f- b k a k ) mod /; 

U q = u, u k = (u fc _i + 2 ak ~ 1 u k -i) mod (2 f — 1); 

v 0 — b 0 u , v k = (Vfc_i + b k 2 dk ~ i u k )mod(2 f — 1). 


(30) 


It is easy to prove by induction on k that 

a k = (2 k e) mod /; u k = (c[2 k }u) mod (2 / — 1); 

d k = ((b k • • ■ Mo )2 e) mod /; v k = (c[(b k ... M 0 ) 2 W mod ( 2/ — 1 )- 


Hence the desired result, (c[6]u) mod (2 f — 1), is v s . The calculation of a k , d k , u k , 
v k from Ufc— i, 4_i, u k —i, v k -i takes 0(log/)+0(log/)-j-0(/)-|-0(/) = O(f) 
cycles, and therefore the entire calculation can be done in sO(f) = G(/log/) 
cycles as desired. 
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The reader will find it instructive to study the ingenious method represented 
by (30) and (31) very carefully. Similar techniques are discussed in Section 4.6.3. 

Schonhage’s paper [Computing 1 (1966), 182-196] shows that t hese i deas can 
be extended to the multiplication of n-bit numbers using r « 2 v/21gn moduli, 
obtaining a method analogous to Algorithm C. We shall not dwell on the details 
here, since Algorithm C is always superior; in fact, an even better method is next 
on our agenda. 

€. Use of discrete Fourier transforms. The critical problem in high-precision 
multiplication is the determination of “convolution products” such as 

u rVo “h u r—lVl -f~ ’ * ' UqV t , 

and there is an intimate relation between convolutions and an important math¬ 
ematical concept called “Fourier transformation.” If u = exp(2ni/K) is a K th 
root of unity, the one-dimensional Fourier transform of the sequence of com¬ 
plex numbers (uo, tti,..., Uk—i) is defined to be the sequence (u 0 , ui ,..., uk— 1 )» 
where 

u s = ^ u st u t , 0 < s < K. (32) 

0 <t<K 

Letting (t) 0 , V\, ..., vk—i) be defined in the same way, as the Fourier transform 
of (i> 0 , ..., Vk~ i), it is not difficult to see that (uqVo, iiiVi, ..., uk—iVk— i) 

is the transform of (wo, u>i,..., Wk— i), where 

w r = U r V 0 + Ur^iVi -f- • • • -\- UqV t -j- Uk~ l^r+1 +-1“ 1 

= 5Z UiV i' 

i-\-j=r (modulo K) 

When K > 2 n — 1 and u n = = • • • = uk —i = v n = v n -f-i = • • • = 

vk —i = 0, the w 's are just what we need for multiplication, since the terms 

Uk— l^r+H- \-u r +iVK —i vanish when 0 < r < 2n —2. In other words, the 

transform of a convolution product is the ordinary product of the transforms. 
This idea is actually a special case of Toom’s use of polynomials (cf. (10)), with 
x replaced by roots of unity. 

If K is a power of 2, the discrete Fourier transform (32) can be obtained quite 
rapidly when the computations are arranged in a certain way, and so can the 
inverse transform (determining the w *s from the w’s). This property of Fourier 
transforms was exploited by V. Strassen in 1968, who discovered how to multiply 
large numbers faster than was possible under all previously known schemes. He 
and A. Schonhage later refined the method and published improved procedures 
in Computing 7 (1971), 281-292. In order to understand their approach to the 
problem, let us first take a look at the mechanism of fast Fourier transforms. 

Given a sequence of K = 2 k complex numbers (tio, •.., Uk— i), and given 
the complex number 

(jj — exp(27ri/K), 

the sequence ({to,..., Uk— i) defined in (32) can be calculated rapidly by carrying 
out the following scheme. (In these formulas the parameters Sj and t 3 are either 
0 or 1, so that each “pass” represents 2 k computations.) 
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Pass 0. Let A^°\tk— 1 , ...,£q) = u t , where t = (tk— 1 • • • £ 0 ) 2 - 
Pass 1. Set A^(s k — 1 , £ fc _ 2 ,..., £ 0 ) <- 

Ai 0 ^(0, t k — 2, - ■ • J £ 0 ) H“ u;( Sfc — 10 -°)2 . A^(l, £fc— 2 , • • • J to). 

Pass 2. Set Ai 2 ](s fc _i, s fc _ 2 , £*- 3 , • • •, £ 0 ) 

0, tk— 3 , • • • ? £ 0 ) + a;( Sfc - 2Sfc - l0 " °) 2 • Ai^Jsfc—i, l,tk— 3 > • • •»£o)- 


Pass A:. Set A^(sfc_i,..., Si, so) <— 

A[ fc— 1 l(3fc_i, . • . , Si, 0) + a/soSi-Sfc-iL . ..,31, 1). 

It is fairly easy to prove by induction that we have 


A^{s k — !)•••) $k — j > tk—j — 1 , . . . , £ q ) 


^2 <jJ (s 0 8 1 ...s k - l hit k - 1 ...t k - j O...Q) 2 Utf (33) 




so that 


A [fcl (s fc _i, ...,3i,3 0 ) = u s , where s = (5 0 si ...s fc _i) 2 . (34) 


(Note the reversed order of the binary digits in s. For further discussion of 
transforms such as this, see Section 4.6.4.) 

To get the inverse Fourier transform (uo,... ,uk— 1 ) from the values of 
(&o,••• , 1)5 we may note that the “double transform” is 

U r = ^ U! rS U s = ^2 U TS U St U t 

0 <s<K 0 <s,t<K 

= E Ut ( E “ <( ‘ +,) ) = ^ ( -r)»dlf, 

0 <t<K ^0 <s<K ' 


since the geometric series ^ 0<5< k ^ 3 sums t° zero unless j is a multiple 
of K. Therefore the inverse transform can be computed in the same way as 
the transform itself, except that the final results must be divided by K and 
shuffled slightly. 

Applying this to the problem of integer multiplication, suppose we wish to 
compute the product of two n-bit integers u and v. As in Algorithm C we shall 
work with groups of bits; let 

2n<2 k l<4n, K = 2 k , L = 2 l , (35) 

and write 

u = (U K -i . • • £Wo)l, v = (V K -i • • • ViVo)l, (36) 

regarding u and v as A-place numbers in radix L so that each digit U 3 or Vj is 
an Lbit integer. Actually the leading digits Uj and Vj are zero for all j > K/ 2, 
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because 2 k ~ x l > n. We will select appropriate values for k and l later; at the 
moment our goal is to see what happens in general, so that we can choose k 
and l intelligently when all the facts are before us. 

The next step of the multiplication procedure is to compute the Fourier 
transforms (uq, , uk—i) and (Do,..., vk— i) of the sequences (uq, ..., Uk— i) 
and (v 0 ,..., vk—i), where we define 

u t = Ut/2 k+l , Vt = V t /2 k + l . (37) 

This scaling is done for convenience so that the absolute values |u 4 | and \v t \ are 
less than 2 —fc , ensuring that \u s \ and \v s \ will be less than 1 for all s. 

An obvious problem arises here, since the complex number oj can’t be rep¬ 
resented exactly in binary notation. How are we going to compute a reliable 
Fourier transform? By a stroke of good luck, it turns out that everything will 
work properly if we do the calculations with only a modest amount of precision. 
For the moment let us bypass this question and assume that infinite-precision 
calculations are being performed; we shall analyze later how much accuracy is 
actually needed. 

Once the u s and v s have been found, we let w s = u s v s for 0 < s < K and 
determine the inverse Fourier transform (iuq,..., vok— i). As explained above, 
we now have ___ 

W r = UiV 3 = UiVj/ 2 2k + 21 , 

-j—(—j=r i-\-jz=r 

so the integers W r = 2 2 kJr 2 l w r are the coefficients in the desired product 

u-v = W K -2 L K ~ 2 H-f W\L + W 0 . (38) 

Since 0 < W r < (r + 1)L 2 < KL 2 , each W r has at most k -\-2l bits, so it will 
not be difficult to compute the binary representation when the VF’s are known 
unless k is large compared to l. 

Let us try to estimate how much time this method takes, if m-bit fixed 
point arithmetic is used in calculating the Fourier transforms. Exercise 10 shows 
that all of the quantities during all the passes of the transform calculations 
will be less than 1 in magnitude because of the scaling (37), hence it suffices to 
deal with m-bit fractions (.a_i... a _ m )2 for the real and imaginary parts of all 
the intermediate quantities. Simplifications are possible because the inputs ut 
and v t are real-valued; only K real values instead of 2 K need to be carried in 
each step (see exercise 4.6.4-14). We will ignore such refinements in order to 
keep complications to a minimum. 

The first job is to compute uj and its powers. For simplicity we shall make 
a table of the values w°, ..., uj k ~ 1 . Let 

uj r = exp(27a/2 r ), 

so that = —1, u )2 = i, 0 J 3 = (1 -j- i)/V 2, ..., u>k = w. If u r = x r + iy rj we 
have 





4 . 3.3 


HOW FAST CAN WE MULTIPLY? 293 


The calculation of wi, W 2 , ..., w/t takes negligible time compared with the other 
computations we need, so we can use any straightforward algorithm for square 
roots. Once the u r have been calculated we can compute all of the powers uP 
by starting with w° = l and using the following idea for j > 0: If j = 2 K r ■ q 
where q is odd, and if jo = 2 K ~ r • (q — 1), we have 

uP = uP° • u r . (40) 

This method of calculation keeps errors from propagating, since each u 3 is a 
product of at most k of the cu r ’s. The total time to calculate all the up is 
0(KM ), where M is the time to do an m-bit complex multiplication; this is less 
time than the subsequent steps will require, so we can ignore it. 

Each of the three Fourier transformations comprises k passes, each of which 
involves K operations of the form a <— b -j- uPc, so the total time to calculate 
the Fourier transforms is 


O(kKM) = 0(Mnk/l). 


Finally, the work involved in computing the binary digits of u • v using (38) is 
0(K(k + /)) = 0(n -f nk/l). Summing over all operations, we find that the 
total time to multiply n-bit numbers u and v will be 0(n) -j- 0(Mnk/l). 

Now let’s see how large the intermediate precision m needs to be, so that 
we know how large M needs to be. For simplicity we shall be content with 
safe estimates of the accuracy, instead of finding the best possible bounds. It 
will be convenient to compute all the uP so that our approximations (uP)' will 
satisfy |(uP)'| < 1; this condition is easy to guarantee if we truncate towards 
zero instead of rounding. The operations we need to perform with m-bit fixed 
point complex arithmetic are all obtained by replacing an exact computation of 
the form a <— b -j- u 3 c by the approximate computation 

a' truncate( 6 / + (oP)V), (41) 


where £/, (oj 3 )', and d are previously computed approximations; all of these 
complex numbers and their approximations are bounded by 1 in absolute value. 
If |y — b\ < <5i, \(<jj 3 Y — uj 3 \ < 62 , and \c’ — c\ < 63 , it is not difficult to see that 
we will have | a' — a\ < 6 -j- 61 62 + ^ 3 ) where 

6 = \2~ m -\- 2~ m i\ = 2 1/2 ~ m , 


because we have |(cj j )V — u 3 c = |((cj :? ) / — a > 3 )c' -J- uj 3 (c' — c)| < 62 + ^ 3 , and 
8 is the maximum truncation error. The approximations (u 3 )' are obtained by 
starting with approximate values u)' r to the numbers defined in (39), and we may 
assume that |u/' — u r \ < 8 . Each multiplication (40) has the form of (41) with 
b' = 0 , so an additional error of at most 28 is made per multiplication, and we 
have |(upy — uP < ( 2 k — 1)<5 for all j. 
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If we have errors of at most e before any pass of the fast Fourier transform, 
the operations of that pass therefore have the form (41) where 8 \ = 63 = e and 
82 = (2 k — 1)<5, and the errors after the pass will be at most 2e -f 2 k 8 . There is 
no error in “Pass 0,” so we find by induction on j that the maximum error after 
“Pass j” is bounded by (2 J — 1 ) • 2 k 8 , and the computed values of u s will satisfy 
\(u s y — u s \ < ( 2 k — 1 ) • 2 k 8 . A similar formula will hold for (u s )'; and we will 
have 

| - & s j < 2 ( 2 k — 1 ) • 2kS + S. 

During the inverse transformation there is an additional accumulation of errors, 
but the division by K = 2 k ameliorates most of this; by the same argument we 
find that the computed values w' r will satisfy 

| w' r — w r \ < 4k2 k 8. 

We need enough precision to make 2 2 kJr 2 l w' r round to the correct integer W r , 
hence we need 

2^—)— 2^ —|— 2 —|— lg fc —|— fc —|— 1/2- 171 < ^' 

- 2 ’ 

i.e., m > 3k + 21 lg k -j- 7/2. This will hold if we simply require that 

k > 7 and m > 4k -f- 21. (42) 

Relations (35) and (42) can be used to determine parameters k , l, m so that 
multiplication takes 0{n) -f- 0(Mnk/l ) units of time, where M is the time to 
multiply m-bit fractions. 

If we are using MIX, for example, suppose we want to multiply binary num¬ 
bers having n = 2 13 = 8192 bits each. We can choose k = 11, l — 8 , m = 60, 
so that the necessary m-bit operations are nothing more than double precision 
arithmetic. The running time M needed to do fixed point m-bit complex multi¬ 
plication will therefore be comparatively small. With triple-precision operations 
we can go up for example to A; = l = 15, n < 15 • 2 14 , which takes us way 
beyond Mix’s memory capacity. 

Further study of the choice of k, l, and m leads in fact to a rather surprising 
conclusion: For a 11 practical purposes we can assume that M is constant , and the 
Schonhage-Strassen multiplication technique will have a running time linearity 
proportional to n. The reason is that we can choose k = l and m = 6 k; this 
choice of k is always less than lgn, so we will never need to use more than 
sextuple precision unless n is larger than the word size of our computer. (In 
particular, n would have to be larger than the capacity of an index register, so 
we probably couldn’t fit the numbers u and v in main memory.) 

The practical problem of fast multiplication is therefore solved, except for 
improvements in the constant factor. In fact, the all-integer convolution algo¬ 
rithm of exercise 4.6.4-59 is probably a better choice for practical high-precision 
multiplication, even though it has a slightly worse asymptotic behavior. Our 
interest in multiplying large numbers is partly theoretical, however, because it 
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is interesting to explore the ultimate limits of computational complexity. So 
let’s forget practical considerations and suppose that n is extremely huge, per¬ 
haps much larger than the number of atoms in the universe. We can let m be 
approximately 61gn, and use the same algorithm recursively to do the m-bit 
multiplications. The running time will satisfy T(n) = 0(nT(logn)); hence 

T(n) < C n(C lg n)(C lg lg n)(C lg lg lg n)..., 

where the product continues until reaching a factor with lg... lgn < 1. 

Schonhage and Strassen showed how to improve this theoretical upper bound 
to 0 (n log n log log n) in their paper, by using integer numbers u to carry out fast 
Fourier transforms on integers, modulo numbers of the form 2 e —|— 1. This upper 
bound applies to Turing machines, i.e., to computers with bounded memory and 
a finite number of arbitrarily long tapes. 

If we allow ourselves a more powerful computer, with random access to any 
number of words of bounded size, Schonhage has pointed out that the upper 
bound drops to 0(n log n). For we can choose k = l and m — 6/c, and we 
have time to build a complete multiplication table of all possible products xy 
for 0 < x, y < (The number of such products is 2 k or 2 /c + 1 , and we 

can compute each table entry by addition from one of its predecessors in 0 (k) 
steps, hence 0(k2 k ) = 0(n) steps will suffice for the calculation.) In this case 
M is the time needed to do 12-place arithmetic in radix and it follows 

that M — O(k) = O(logn) because 1-place multiplication can be done by table 
lookup. 

Schonhage discovered in 1979 that a pointer machine can carry out n-bit 
multiplication in 0(n) steps; see exercise 12. Such devices (which are also called 
“storage modification machines” and “linking automata”) seem to provide the 
best models of computation when n —> oo, as discussed at the end of Section 2.6. 
So we can conclude that multiplication in 0(n ) steps is possible for theoretical 
purposes as well as in practice. 

D. Division. Now that we have efficient routines for multiplication, let’s consider 
the inverse problem. It turns out that division can be performed just as fast as 
multiplication, except for a constant factor. 

To divide an n-bit number u by an n-bit number v, we may first find an 
n-bit approximation to l/v, then multiply by u to get an approximation q to 
u/v ; finally, we can make the slight correction necessary to q to ensure that 
0 < u — qv < v by using another multiplication. From this reasoning, we see 
that it suffices to have an efficient algorithm for approximating the reciprocal of 
an n-bit number. The following algorithm does this, using “Newton’s method” 
as explained at the end of Section 4.3.1. 

Algorithm R (High-precision reciprocal). Let v have the binary representation 
v = (O.V 1 V 2 V 3 • • • ) 2 , where v\ — 1. This algorithm computes an approximation 
2 to l/v, such that 


z — l/v\ < 2~ n . 


( 43 ) 
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Rl. [Initial approximation.] Set z <— J[32/(4vi 2 i / 2 + ^ 3 )] and fc <— 0. 

R2. [Newtonian iteration.] (At this point we have a number z of the binary 

form (xx.xx.. .x )2 with 2 k -f 1 places after the radix point, and z < 2.) 
Calculate z 2 = (xxx.xx ... 2)2 exactly, using a high-speed multiplication 
routine. Then calculate VkZ 2 exactly, where 14 = (O.V 1 V 2 ••• v 2 fc+ 1 + 3 ) 2 ■ 

Then set 2 <— 2z—Vk z 2 -\-r , where 0 < r < 2 ~ 2lc+1 — 1 is added if necessary 
to “round up” z so that it is a multiple of 2~ 2>c+1 ~ 1 . Finally, set k *- +1. 

R3. [Test for end.] If 2 k < n, go back to step R2; otherwise the algorithm 

terminates. | 

This algorithm is based on a suggestion by S. A. Cook. A similar technique 
has been used in computer hardware [see Anderson, Earle, Goldschmidt, and 
Powers, IBM J. Res. Dev. 11 (1967), 48-52]. Of course, it is necessary to check 
the accuracy of Algorithm R quite carefully, because it comes very close to being 
inaccurate. We will prove by induction that 

z <2 and \z — l/v\ < 2 ~^ (44) 

at the beginning and end of step R2. 

For this purpose, let 6k = 1/v — Zk, where Zk is the value of z after k 
iterations of step R2. To start the induction on k, we have 

6 0 = 1/v — 8/V + (32/?/ — |_32/?/J)/4 = rji + rj 2 , 

where v' — (v\V 2 V ^)2 and 771 — (v' — Sv)/vv', so that we have —J < 771 < 0 
and 0 < r ]2 < Hence |5 q| < J* Now suppose that (44) has been verified 
for k; then 

<5fc +1 = 1/v — z k +i = 1/v — z k — z k { 1 — z k V k ) — r 

= 6 k — z k ( 1 — z k v) — z 2 k {y — V k ) — r 
= 6 k — ( 1/v — 6 k )v6 k — z 2 k (v — V k ) — r 




= vS l - Z K V - Vk) — r. 

Now 




0 < vSl 

<Sl< (2-*V = 2- 2 ‘ +1 , 

and 




0 < z 2 (v -V k ) + r 

< 4(2- 2 ‘ +1 - 3 ) + 2~ 2 ‘ +1 — 1 = 2~ 2 ' c+1 ) 


so | < 2 2fc+1 . We must still verify the first inequality of (44); to show 
that z k _(_i < 2 , there are three cases: (a) 14 = J; then Zk +1 =2. (b) 14 7 ^ 

i = Vjfe— 1 ; then z k — 2, so 2 z k — z\V k <2 — 2~ 2k+1 - 1 . ( c ) V k -\ ^ J; then 
Zk +1 — 1/v — 4+1 < 2 — 2~ 2>c+1 < 2, since k > 0. 

The running time of Algorithm R is bounded by 


2T(4n) + 2T(2n) + 2T(n) + 2T(in) + ■ ■ ■ + O(n) 
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steps, where T(n) is an upper bound on the time needed to do a multiplication of 
n-bit numbers. If T(n) has the form nf(ri) for some monotonically nondecreasing 
function f(n), we have 

T(4n) + T(2n) + T(n) + ■ • ■ < T(8n), 

so division can be done with a speed comparable to that of multiplication except 
for a constant factor. 

R. P. Brent has shown that functions such as logx, expx, and arctanx can 
be evaluated to n significant bits in 0(M(n) log n) steps, if it takes M(n) units 
of time to multiply n-bit numbers [JACM 23 (1976), 242-251]. 

E. An even faster multiplication method. It is natural to wonder if multiplication 
of n-bit numbers can be accomplished in just n steps. We have come from order 
n 2 down to order n, so perhaps we can squeeze the time down to the absolute 
minimum. In fact, it is actually possible to output the answer as fast as we input 
the digits, if we leave the domain of conventional computer programming and 
allow ourselves to build a computer that has an unlimited number of components 
all acting at once. 

A linear iterative array of automata is a set of devices Mi, M 2 , M 3 , ... 
that can each be in a finite set of “states” at each step of a computation. The 
machines M 2 , M 3 , ... all have identical circuitry, and their state at time t -j- 1 
is a function of their own state at time t as well as the states of their left and 
right neighbors at time t. The first machine Mi is slightly different: its state at 
time t -j- 1 is a function of its own state and that of M 2 , at time t, and also of 
the input at time t. The output of a linear iterative array is a function defined 
on the states of M x . 

Let u = (u n _i... U!Uo) 2 , v — (v„_i... ViV 0 ) 2 , and q = {q n -i • • • 91 ^ 0)2 
be binary numbers, and let uv-\-q = w = (w 2n — 1 ... W\ Wo) 2 . It is a remarkable 
fact that a linear iterative array can be constructed, independent of n, that will 
output Wo, Wij w 2 , ... at times 1, 2, 3, ..., if it is given the inputs (ito, Vq, qo), 
(ui,vi,qi), (u 2 ,v 2 , q 2 ), ... at times 0, 1, 2, ... . 

We can state this phenomenon in the language of computer hardware, by 
saying that it is possible to design a single “integrated circuit module” with 
the following property: If we wire together sufficiently many of these devices in a 
straight line, with each module communicating only with its left and right neigh¬ 
bors, the resulting circuitry will produce the 2 n-bit product of n-bit numbers in 
exactly 2 n clock pulses. 

Here is the basic idea behind this construction: At time 0, machine Mi senses 
(u 0 ,Vo,qo) and it therefore is able to output (uqVo -j- qo) m °d2 at time 1. Then 
it sees (ui,vi,qi) and it can output (noffi -\~UiVq + q\ + fci)mod 2 , where k\ is 
the “carry” left over from the previous step, at time 2. Next it sees (u 2 ,v 2 ,q 2 ) 
and outputs (u 0 v 2 -f- UiV\ -\~ u 2 vq + q 2 + k 2 ) mod 2 ; furthermore, its state holds 
the values of u 2 and v 2 so that machine M 2 will be able to sense these values at 
time 3, and M 2 will be able to compute u 2 v 2 for the benefit of Mi at time 4. 
Machine Mi essentially arranges to start M 2 multiplying the sequence (u 2i v 2 ), 
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Table 1 


MULTIPLICATION IN A LINEAR ITERATIVE ARRAY 


Time Input 


Module Mi 


Module M 2 


Module M 3 


Xq Xi X 
C Zl 

yo yi y „ 

20 


Xo Xi X 
C Zl 

yo yi y , 

^0 


Xo Xi X 

c yo yi y Zl 

Zo 


0 0 
0 0 


0 0 
0 0 


0 0 
0 0 


1 0 
1 0 


0 0 
0 0 


0 0 
0 0 


1 1 
1 1 


0 0 
0 0 


0 0 
0 0 


1 1 
1 1 


0 0 
0 0 


0 0 
0 0 


1 1 
1 1 


0 0 
0 0 


0 0 
0 0 


1 1 
1 1 


0 0 
0 0 


0 0 
0 0 


1 1 
1 1 


0 1 
0 1 


0 0 
0 0 


1 1 
1 1 


0 0 
0 0 


0 0 
0 0 


1 1 
1 1 


0 0 
0 0 


0 0 
0 0 


1 1 
1 1 


0 0 
0 0 


0 0 
0 0 


1 1 
1 1 


0 0 
0 0 


0 0 
0 0 


1 1 
1 1 


0 0 
0 0 


0 0 
0 0 
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(u 3 ,v 3 ), and M 2 will ultimately give M 3 the job of multiplying (u 4 > v 4 ), 
(ii 5j v 5 ), etc. Fortunately, things just work out so that no time is lost. The reader 
will find it interesting to deduce further details from the formal description that 
follows. 

Each automaton has 2 11 states 


(c, x 0 ,y 0 ,xi,y 1 ,x,y,z 2 ,z 1 ,zo), 


where 0 < c < 4 and each of the x’s, y' s, and z*s is either 0 or 1. Initially, 
all devices are in state ( 0 ,0, 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ). Suppose that a machine M v for 
j > 1, is in state (c, Xq, y 0 , xi, yi, x, y, z 2 , Z\, zq ) at time t, and its left neighbor 
Mj—i is in state ( c l , x l 0 ,y l Q , x[, y[, x l , y l , z l 2 , z[,z l 0 ) while its right neighbor Mj+i 
is in state (c r , x r 0 , y r Q , x\, y\, x r , y r , z r 2 , z\, z r 0 ) at that time. Then machine Mj will 
go into state ( cf , x ' q , y' Q , y', z 2 , z' v z[ j) at time t + 1 , where 


c' — min(c -(-1)3) if c l = 3, 

0 

otherwise; 

{x'o,yo) = (x l ,y l ) ^ c = o> 

(i 0 ,2/o) 

otherwise; 

K, 2 /i) = (x l ,y 

l ) if C = 1, 

(a; 1 , 3 / 1 ) otherwise; 

(x',y f ) = (x l ,y 

l ) if c > 2 , 

(x,y) 

otherwise; 

and (z^z'iz' Q ) 2 is the binary notation for 




f x l y l , 


if c = 0 ; 

Zq + Z\ + Z 2 + < 

| x 0 y l + x l y 0 , 

Xoy 1 + Xtyi +x l y 0: 

if c = 1 ; 
if c = 2 ; 


. x 0 y l -f x x y + xyi +x l y 0 , 

if c = 3. 


(45) 


(46) 


The leftmost machine Mi behaves in almost the same way as the others; it acts 
exactly as if there were a machine to its left in state (3,0,0,0,0, u , v , q, 0,0) when 
it is receiving the inputs ( u , v , q). The output of the array is the zo component 
of Mi. 

Table 1 shows an example of this array acting on the inputs 


u = v = (... 00010111)2, <? = (... 00001011)2. 


The output sequence appears in the lower right portion of the states of Mi: 


0 , 0 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , ..., 

representing the number (... 01000011100)2 from right to left. 

This construction is based on a similar one first published by A. J. Atrubin, 
IEEE Trans. EC-14 (1965), 394-399. 

S. Winograd [JACM 14 (1967), 793-802] has investigated the minimum 
multiplication time achievable in a logical circuit when n is given and when the 
inputs are available all at once in coded form. See also C. S. Wallace, IEEE 
Thins. EC-13 (1964), 14-17; A. C. Yao, to appear. 
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EXERCISES 


1. [22] The idea expressed in (2) can be generalized to the decimal system, if the 
radix 2 is replaced by 10. Using this generalization, calculate 2718 times 4742 (reducing 
this product of four-digit numbers to three products of two-digit numbers, and reducing 
each of the latter to products of one-digit numbers). 

2. [ M22 ] Prove that, in step Cl of Algorithm C, the value of R either stays the same 
or increases by one when we set R <— [\/QJ- (Therefore, as observed in that step, we 
need not calculate a square root.) 

3. [ M23] Prove that the sequences q k , r k defined in Algorithm C satisfy the in¬ 
equality 2 9fc+1 (2 Tk) ric < 2 qk ~ 1+<7fc , # when k > 0. 

► 4. [28] (K. Baker.) Show that it is advantageous to evaluate the polynomial W(x) 
at the points x = — r, ..., 0 , ..., r instead of at the nonnegative points x = 0 , 1, 
..., 2r as in Algorithm C. The polynomial U{x ) can be written 

U(X) = U e {x 2 ) + XU 0 {X 2 ), 


and similarly V(x) and W(x) can be expanded in this way; show how to exploit this 
idea, obtaining faster calculations in steps C7 and C8. 

► 5. [35] Show that if in step Cl of Algorithm C we set R «— \y/2Q] + 1 instead of 
R ■*— [VQ\, with suitable initial values of qo, qi, to, and ri, then (19) can be improved 

to t k < q fc+ i2V /21sqfe+1 (lgqjt+i). 


6. [M23] Prove that the six numbers in (22) are relatively prime in pairs. 

7. [M23] Prove (23). 


► 8. [25] Prove that it takes only 0(K logiC) arithmetic operations to evaluate the 
discrete Fourier transform (32), even when K is not a power of 2. [Hint: Rewrite (32) 
in the form 



Js+tf/2^/2 


U t 


0 <t<K 


and express this sum as a convolution product.] 

9. [ M15] Suppose the Fourier transformation method of the text is applied with all 
occurrences of u> replaced by u) Q , where q is some fixed integer. Find a simple relation 
between the numbers (uo, ui, . .., uk— i) obtained by this general procedure and the 
numbers (wo, u i,..., uk— i) obtained when q = 1. 

10. [M26] The scaling in (37) makes it clear that all the complex numbers 
computed by pass j of the transformation subroutine will be less than 2 J ~ k in absolute 
value, during the calculations of u s and v s in the Schonhage-Strassen multiplication 
algorithm. Show that all of the will be less than 1 in absolute value during the 
third Fourier transformation (the calculation of w r ). 

► 11. [M26] If n is fixed, how many of the automata in the linear iterative array (45), 
(46) are needed to compute the product of n-bit numbers? (Note that the automaton 
Mj is influenced only by the component z r 0 of the machine on its right, so we may 
remove all automata whose zo component is always zero whenever the inputs are n-bit 
numbers.) 
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► 12 . [50] (A. Schonhage.) The purpose of this exercise is to prove that a simple form 
of pointer machine can multiply n-bit numbers in 0(n) steps. The machine has no 
built-in facilities for arithmetic; all it does is work with nodes and pointers. Each node 
has the same finite number of link fields, and there are finitely many link registers. The 
only operations this machine can do are: 

i) read one bit of input and jump if that bit is 0; 

ii) output 0 or 1; 

iii) load a register with the contents of another register or with the contents of a 
link field in a node pointed to by a register; 

iv) store the contents of a register into a link field in a node pointed to by a register; 

v) jump if two registers are equal; 

vi) create a new node and make a register point to it; 

vii) halt. 

Implement the Fourier-transform multiplication method efficiently on such a machine. 
[Hints: First show that if N is any positive integer, it is possible to create N nodes 
representing the integers {0,1,. .., N — 1}, where the node representing p has pointers 
to the nodes representing p -(- 1, |_p/2j, and 2 p. These nodes can be created in O(N) 
steps. Show that arithmetic with radix N can now be simulated without difficulty: for 
example, it takes O(logTV) steps to find the node for (p-bq)modlV and to determine if 
p-\-q > N, given pointers to p and q; and multiplication can be simulated in O(logTV) 2 
steps. Now consider the algorithm in the text, with k = l and m = 6k and N = 
2 lm/13 \ so that all quantities in the fixed point arithmetic calculations are 13-place 
integers with radix N. Finally, show that each pass of the fast Fourier transformations 
can be done in 0{K (IV log IV) 2 ) = O(K) steps, using the following idea: Each of 
the K necessary assignments can be “compiled” into a bounded list of instructions 
for a simulated MIX-like computer whose word size is N, and instructions for K such 
machines acting in parallel can be simulated in 0(K -)- (IV log IV) 2 ) steps if they are 
first sorted so that all identical instructions are performed together. (Two instructions 
are identical if they have the same operation code, the same register contents, and the 
same memory operand contents.) Note that N 2 = 0(n 12 ^ 13 ), so (IV log IV) 2 = O(K).] 

13 . [ M25 } (A. Schonhage.) What is a good upper bound on the time needed to 
multiply an m-bit number by an n-bit number, when both m and n are very large but 
n is much larger than m, based on the results proved in this section for m = n? 

14 . [M42] Write a program for Algorithm C, incorporating the improvements of 
exercise 4. Compare it with a program for Algorithm 4.3.1M and with a program 
based on (2), to see how large n must be before Algorithm C is an improvement. 

15 . [. M49 ] (S. A. Cook.) A multiplication algorithm is said to be on line if the (k-j- l)st 
input bits of the operands, from right to left, are not read until the kth output bit 
has been produced. What are the fastest possible on-line multiplication algorithms 
achievable on various species of automata? 

(The best upper bound known is 0(n(logn) 2 log log n), due to M. J. Fischer and 
L. J. Stockmeyer [J. Comp, and Syst. Sci. 9 (1974), 317-331]; their construction works 
on multitape Turing machines, hence also on pointer machines. The best lower bound 
known is of order nlogn/loglogn, due to M. S. Paterson, M. J. Fischer, and A. R. 
Meyer [SIAM/AMS Proceedings 7 (1974), 97-111]; this applies to multitape Turing 
machines but not to pointer machines.) 
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4.4. RADIX CONVERSION 

If OUR ANCESTORS had invented arithmetic by counting with their two fists 
or their eight fingers, instead of their ten “digits,” we would never have to worry 
about writing binary-decimal conversion routines. (And we would perhaps never 
have learned as much about number systems.) In this section, we shall discuss 
the conversion of numbers from positional notation with one radix into positional 
notation with another radix; this process is, of course, most important on binary 
computers when converting decimal input data into binary form, and converting 
binary answers into decimal form. 

A. The four basic methods. Binary-decimal conversion is one of the most 
machine-dependent operations of all, since computer designers keep inventing 
different ways to provide for it in the hardware. Therefore we shall discuss only 
the general principles involved, from which a programmer can select the proce¬ 
dure that is best suited to his machine. 

We shall assume that only nonnegative numbers enter into the conversion, 
since the manipulation of signs is easily accounted for. 

Let us assume that we are converting from radix 6 to radix 5. (The methods 
can also be generalized to mixed-radix notations, as shown in exercises 1 and 2 .) 
Most radix-conversion routines are based on multiplication and division, using 
one of the following four schemes: 

1 ) Conversion of integers (radix point at the right). 

• Method (la) Division by B (using radix -6 arithmetic). Given an integer number 
u, we can obtain its radix-5 representation (Um • • • U\Uq)b as follows: 

Uq = u mod B 
U\ = [u/B\ mod B 
U 2 = |_Lu/5J/5Jmod5 


etc., stopping when [... [[u/B\/B \... /B\ = 0. 

• Method (lb) Multiplication by b (using radix-5 arithmetic). If u has the 
radix -6 representation (u m ...uiUo)b, we can use radix-5 arithmetic to evaluate 
the polynomial u m b m -)-f- uib + u 0 = u in the form 

((■•■(« m b + Wm—1 )6-f-) 6 + Ui) b -\- uo- 

2 ) Conversion of fractions (radix point at the left). Note that it is often im¬ 
possible to express a terminating radix -6 fraction ( 0 .'ii__ ;L u _2 .. .u_ m ) b exactly 
as a terminating radix-5 fraction ( O.U—iU —2 ... 5_m)b- For example, the frac¬ 
tion has the infinite binary representation ( 0 . 0001100110011 .. .) 2 . Therefore 
methods of rounding the result to M places are sometimes necessary. 
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• Method (2a) Multiplication by 5 (using radix-6 arithmetic). Given a fractional 
number u, we can obtain the digits of its radix-5 representation (.[/__ it /_2 • • - )b 
as follows: 

U-i = [uB\ 

U — 2 = [{uB}B\ 

U — 3 = L {{uB}B}B\ 

where {x} denotes 2 modi = x — [x\. If it is desired to round the result to 
M places, the computation can be stopped after U—m has been calculated, and 
U—m should be increased by unity if { ... {{uB}B} ...B} is greater than J. 
(Note, however, that this may cause carries to propagate, and these carries must 
be incorporated into the answer using radix-5 arithmetic. It would be simpler 
to add the constant \B~ M to the original number u before the calculation 
begins, but this may lead to a terribly incorrect answer when \B~ M cannot be 
represented exactly as a radix-6 number inside the computer. Note further that 
it is possible for the answer to round up to (1.00... 0)#, if 6 m > 2 B M .) 

Exercise 3 shows how to extend this method so that M is variable , just large 
enough to represent the original number to a specified accuracy; in this case the 
problem of carries does not occur. 

• Method (2b) Division by 6 (using radix-5 arithmetic). If u has the radix-6 

representation (0.w_i^_ 2 .. .u— m )b, we can use radix-5 arithmetic to evaluate 
U—ib~ 1 -j- U— 2 6 —2 -|-1- U— m b~~ m in the form 

((... (ii_ m /6 -(- Ui— m )/b +-1- u_ 2 )/b -j- tx_i)/6. 

Care should be taken to control errors that might occur due to truncation or 
rounding in the division by 6; these are often negligible, but not always. 

To summarize, Methods (la), (lb), (2a), and (2b) give us two choices for a 
conversion process, depending on whether our number is an integer or a fraction. 
And it is certainly possible to convert between integers and fractions by multi¬ 
plying or dividing by an appropriate power of 6 or 5; therefore there are at least 
four methods to choose from when trying to do a conversion. 

B. Single -precision conversion. To illustrate these four methods, let us assume 
that MIX is a binary computer, and suppose that we want to convert a binary 
integer u to a decimal integer. Thus 6 = 2 and 5 = 10. Method (la) could be 
programmed as follows: 

ENT1 0 Set j <- 0. 

LDX U 

ENTA 0 Set rAX <- u. 

1H DIV =10= (rA, rX) ()rAX/10j, rAX mod 10). m 

STX ANSWER ,1 U 3 «- rX. W 

INC1 1 j «— j + 1. 

SRAX 5 rAX <— rA. 

JXP IB Repeat until result is zero. | 

This requires 18 M + 4 cycles to obtain M digits. 
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The above method uses division by 10; Method (2a) uses multiplication 
by 10, so it might be a little faster. In order to use Method (2a), we must deal 
with fractions, and this leads to an interesting situation. Let w be the word size 
of the computer, and assume that u < 10 n < w. With a single division we can 
find q and r, where 

wu — 10 n q 4- r, 0 < r < 10 n . (2) 


Now if we apply Method (2a) to the fraction (q -}- l)/w, we will obtain the digits 
of u from left to right, in n steps, since 






(This idea is due to P. A. Samet, Software—Practice and Experience 1 (1971), 
93-96.) 

Here is the corresponding MIX program: 

JOV 0FL0 Ensure overflow is off. 

LDA U 

LDX =10 n = rAX <-wu + 10 n 

DIV =10 n = rA<-g + l, rX <-r. 

JOV ERROR Jump if u > 10 n . 

ENT1 n-1 Set j *— n — 1. (4) 

2H MUL =10= Now imagine radix point at left, rA — x. 

STA ANSWER, 1 Set Uj «- |_10x|. 

SLAX 5 x<- {10x}. 

DEC1 1 j+-j — I- 

JINN 2B Repeat for n > j > 0. | 

This slightly longer routine requires 16n -f 19 cycles, so it is a little faster than 
program (1) if n = M > 8 ; when leading zeros are present, (1) will be faster. 

Program (4) as it stands cannot be used to convert integers u > 10 m when 
10 m <w < 10 m+1 , since we would need to take n = m- f- 1. In this case we 
can obtain the leading digit of u by computing [u/ 10 m J; then umod 10 m can be 
converted as above with n = m. 

The fact that the answer digits are obtained from left to right may be an 
advantage in some applications (e.g., when typing out the answer one digit at 
a time). Thus we see that a fractional method can be used for conversion of 
integers, although the use of inexact division makes a little bit of numerical 
analysis necessary. 

A modification of Method (la) can be used to avoid division by 10, by 
replacing it with two multiplications. It is worth mentioning this modification 
here, because radix conversion is often done by small “satellite” computers that 
have no division capability. If we let x be an approximation to so that 
T5 < x < io H - l/ w i it is easy to prove (see exercise 7) that [ux J = [u/ 10 J or 
[u/lOj + so i° n g as 0 < u < w. Therefore, if we compute u — 10 [wxJ, we 
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will be able to determine the value of [u/ 10 J: 

■ /mi = /M. if u - 10 |u:cJ > 0; 

\[ux J — 1 , if u — 10 [wzJ < 0. 



This procedure simultaneously determines umod 10. A MIX program for conver¬ 
sion using this idea appears in exercise 8 ; it requires about 33 cycles per digit. 

The procedure represented by (5) can be used effectively even if the computer 
has no built-in multiplication instruction, since multiplication by 10 consists of 
two shifts and one addition (10 = 2 3 -J- 2). Even the task of multiplication by 
^ can be done by judiciously combining a few shifting and adding operations, 
as explained in exercise 9. 

Another way to convert from binary to decimal is to use Method (lb), but 
to do this we need to simulate doubling in a decimal number system. This idea 
is generally most suitable for incorporation into computer hardware; however, it 
is possible to program the doubling process for decimal numbers, using binary 
addition, binary shifting, and binary extraction (“logical AND” on each bit in the 
register), as shown in the following table. 


Table 1 

DOUBLING A BINARY-CODED DECIMAL NUMBER 


Operation 



General form 




Example 


1. Given 


U1U2U3U4 

UsUeU 7 U8 

Uq U\qUiiU \ 2 

0011 0110 1001 = 

3 6 9 

number 










2. Add 3 to 


V\ V2V3 V4 

VsV6V 7 Vs 

V 9 ^10^11^12 

0110 1001 1100 


each digit 










3. Shift left 

Vi 

v 2 V3 V4 v$ 

V6 V 7 V 8 V 9 

^ 10^11 v 12 

0 

0 1101 0011 1000 


one 










4. Extract low 

Vi 

0 0 0 V5 

0 0 0 Vq 

0 

0 

0 

0 

0 0001 0001 0000 


bit 










5. Shift right 


0 Vi 0 0 

0 ^ 0 0 

0 

^9 

0 

0 

0000 0100 0100 


two 










6. Shift right 


0 V1V1 0 

0 V5V5 0 

0 



0 

0000 0110 0110 


one and add 










7. Add result 

* 

* * * * 

* * * * 

* 

* 

* 

0 

0 1101 1001 1110 


of step 3 










8. Subtract 6 

yi 

2/2 2/3 2/4 2/5 

Ve 2/7 2/s 2/9 

2 / 102 / 112/12 

0 

0 0111 0011 1000 = 

738 


from each 


This method changes each individual digit d into ((d + 3) X 2 + 0) — 6 = 2d 
when 0 < d < 4, and into ((d + 3) X 2 + 6 ) — 6 = (2d — 10) + 2 4 when 
5 < d < q; and that is just what is needed to double decimal numbers encoded 
with 4 bits per digit. 

Another related idea is to keep a table of the powers of two in decimal form, 
and to add the appropriate powers together by simulating decimal addition. A 
survey of bit-manipulation techniques appears in Section 7.1. 
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Finally, even Method (2b) can be used for the conversion of binary integers 
to decimal integers. We can find q as in (2), and then we can simulate the decimal 
division of q -f- 1 by w, using a “halving” process (exercise 10) that is similar 
to the doubling process just described, retaining only the first n digits to the 
right of the radix point in the answer. In this situation, Method (2b) does not 
seem to offer advantages over the other three methods already discussed, but we 
have confirmed the remark made earlier that at least four distinct methods are 
available for converting integers from one radix to another. 

Now let us consider decimal-to-binary conversion (so that b = 10, B = 2). 
Method (la) simulates a decimal division by 2; this is feasible (see exercise 10), 
but it is primarily suitable for incorporation in hardware instead of programs. 

Method (lb) is the most practical method for decimal-to-binary conversion 
in the great majority of cases. Here it is in MIX code, assuming that there are at 
least two digits in the number (u m .. . uiU 0 )io being converted: 

ENT1 M-l Set j <— m — 1. 

LDA INPUT+M Set U <- u m . 

1H MUL =10= 

SLAX 5 (6) 

ADD INPUT, 1 U <- 10t/ + t*j. 

DEC1 1 

JINN IB Repeat for m > j > 0. | 

Note again that adding and shifting may be substituted for the multiplication 
by 10. 

A trickier but perhaps faster method, which uses about lgra multiplica¬ 
tions, extractions, and additions instead of m multiplications and additions, is 
described in exercise 19. 

For the conversion of decimal fractions (0.ii_iu_ 2 ... u_ m )io to binary 
form, we can use Method (2b); or, more commonly, we can convert the integer 
(u—iU —2 ... U—m) 10 by Method (lb) and then divide by 10 m . 

C. Hand calculation. It is occasionally necessary for computer programmers to 
convert numbers by hand, and since this is a subject not yet taught in elementary 
schools, it may be worthwhile to examine it briefly here. There are very simple 
pencil-and-paper methods for converting between decimal and octal notations, 
and these methods are easily learned, so they ought to be more widely known. 

Converting octal integers to decimal. The simplest conversion is from octal to 
decimal; this technique was apparently first published by Walter Soden, Math. 
Comp. 7 (1953), 273-274. To do the conversion, write down the given octal num¬ 
ber; then at the kth step, double the k leading digits using decimal arithmetic, 
and subtract this from the k + 1 leading digits using decimal arithmetic. The 
process terminates in n — 1 steps if the given number has n digits. It is a good 
idea to insert a radix point to show which digits are being doubled, as shown in 
the following example, in order to prevent embarrassing mistakes. 
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Example 1 . Convert (5325121) 8 to decimal. 

5.3 2 5 1 2 1 

— 1 0 

4 3.2 5 1 2 1 

— 8 6 

3 4 6.5 1 2 1 

— 6 9 2 

2 7 7 3.1 2 1 

— 5 5 4 6 

2 2 1 8 5.2 1 

— 4 4 3 7 0 

1 7 7 4 8 2.1 

— 3 5 4 9 6 4 

1 4 1 9 8 5 7 Answer: (14198757)i 0 . 

A reasonably good check on the computations may be had by “casting out 
nines”: The sum of the digits of the decimal number must be congruent modulo 9 
to the alternating sum and difference of the digits of the octal number, with the 
rightmost digit of the latter given a plus sign. In the above example, we have 
1+4 + 1 + 9 + 8 + 5 + 7 = 35, andl — 2 + 1 — 5 + 2 — 3 + 5 = —l;the 
difference is 36 (a multiple of 9). If this test fails, it can be applied to the k + 1 
leading digits after the /cth step, and the error can be located using a “binary 
search” procedure; i.e., we start by checking the middle result, then use the same 
procedure on the first or second half of the calculation, depending on whether 
the middle result is incorrect or correct. 

The “casting-out-nines” process is only about 89 percent reliable, because 
there is one chance in nine that two random integers will differ by a multiple of 
nine. An even better check is to convert the answer back to octal by using an 
inverse method, which we shall now consider. 

Converting decimal integers to octal. A similar procedure can be used for the 
opposite conversion: Write down the given decimal number; then at the kth step, 
double the k leading digits using octal arithmetic, and add these to the k + 1 
leading digits using octal arithmetic. The process terminates in n — 1 steps if 
the given number has n digits. (See Example 2 on the following page.) 

The two procedures just given are essentially Method (lb) of the general 
radix-conversion procedures. Doubling and subtracting in decimal notation is 
like multiplying by 10 — 2 = 8 ; doubling and adding in octal notation is like 
multiplying by 8 + 2 = 10. There is a similar method for hexadecimal/decimal 
conversions, but it is a little more difficult since it involves multiplication by 6 
instead of by 2 . 

To keep these two methods straight in our minds, it is not hard to remem¬ 
ber that we must subtract to go from octal to decimal, since the decimal repre¬ 
sentation of a number is smaller; similarly we must add to go from decimal to 
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Example 2. Convert (141985T)i 0 to octal. 


(Note that the nonoctal digits 8 
and 9 enter into this octal com¬ 
putation.) The answer can be 
checked as discussed above. This 
method was published by Charles 
P. Rozier, IEEE Trans. CE-11 
(1962), 708-709. 


Answer: (5325121) 8 . 

octal. The computations are performed using the radix of the answer, not the 
radix of the given number, otherwise we couldn’t get the desired answer. 

Converting fractions. No equally fast method of converting fractions manually 
is known; the best way seems to be Method (2a), with doubling and adding 
or subtracting to simplify the multiplications by 10 or by 8. In this case, we 
reverse the addition-subtraction criterion, adding when we convert to decimal 
and subtracting when we convert to octal; we also use the radix of the given 
input number, not the radix of the answer, in this computation (see Examples 3 
and 4). The process is about twice as hard as the above method for integers. 


1.4 1 9 8 5 7 

1 6.1 9 8 5 7 

+ 3 4 

2 1 5.9 8 5 7 

+ 4 3 2 

2 6 1 3.8 5 7 

+ 5 4 2 6 

3 3 5 6 6.5 7 

+ 6 7 3 5 4 

4 2 5 2 4 1.7 

+ 1 0 5 2 5 0 2 

5 3 2 5 1 2 1 


Example 3. Convert (.14159)io 
to octal. 

.14159 
2 8 3 1 8 - 

1.13272 
2 6 5 4 4 — 

1.0 6 1 7 6 

1 2 3 5 2 — 

0.4 9 4 0 8 
9 8 8 1 6 — 

3.9 5 2 4 6 

190528 — 

7.6 2 1 1 2 
124224 — 

4.9 6 8 9 6 
Answer: (.110374... )g. 


Example 4. Convert (.110374) 8 
to decimal. 

.110374 

220770 + 

1.3 2 4 7 3 0 
651660 + 

4.1 2 1 1 6 0 
242340 + 

1.4 5 4 1 4 0 
1 1 3 0 3 0 0 + 

5.6 7 1 7 0 0 
1 5 6 3 6 0 0 + 

8.5 0 2 6 0 0 
1 2 0 5 4 0 0 + 

6.233400 
Answer: (.141586... )io- 
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D. Floating point conversion. When floating point values are to be converted, it is 
necessary to deal with both the exponent and the fraction parts simultaneously, 
since conversion of the exponent will affect the fraction part. Given the number 
/ ■ 2 e to be converted to decimal, we may express 2 e in the form F • 10 s (usually 
by means of auxiliary tables), and then convert Ff to decimal. Alternatively, 
we can multiply e by log 10 2 and round this to the nearest integer E; then divide 
/ • 2 e by 10 s and convert the result. Conversely, given the number F • 10-® to be 
converted to binary, we may convert F and then multiply it by the floating point 
number 10 s (again by using auxiliary tables). Obvious techniques can be used to 
reduce the maximum size of the auxiliary tables by using several multiplications 
and/or divisions, although this can cause rounding errors to propagate. 

E. Multiple-precision conversion. When converting extremely long numbers, it 
is most convenient to start by converting blocks of digits, which can be handled 
by single-precision techniques, and then to combine these blocks by using simple 
multiple-precision techniques. For example, suppose that 10 n is the highest 
power of 10 less than the computer word size. Then 

a) To convert a multiple-precision integer from binary to decimal, divide it 
repeatedly by 10 n (thus converting from binary to radix 10 n by Method (la)). 
Single-precision operations will give the n decimal digits for each place of the 
radix-10 n representation. 

b) To convert a multiple-precision fraction from binary to decimal, proceed 
similarly, multiplying by 10 n (i.e., using Method (2a) with B = 10 n ). 

c) To convert a multiple-precision integer from decimal to binary, convert 
blocks of n digits first; then use Method (lb) to convert from radix 10 n to binary. 

d) To convert a multiple-precision fraction from decimal to binary, convert first 
to radix 10 n as in (c), then use Method (2b). 

F. History and Bibliography. Radix-conversion techniques implicitly originated 
in ancient problems dealing with weights, measures, and currencies, where mixed- 
radix systems were generally involved; auxiliary tables were usually prepared to 
help make the conversions. During the seventeenth century, when sexagesimal 
fractions were being supplanted by decimal fractions, it was necessary to convert 
between the two systems in order to use existing books of astronomical tables; 
a systematic method to transform fractions from radix 60 to radix 10 and vice 
versa was given in the 1667 edition of William Oughtred’s Clavis Mathematics, 
Chapter 6, Section 18. (This material was not present in the original 1631 
edition of Oughtred’s book.) Conversion rules had already been given by al- 
Kashi of Persia in his Key to Arithmetic (c. 1414), where Methods (la), (lb), 
and (2a) are clearly explained [Istoriko-Mat. Issled. 7 (1954), 126-135], but his 
work was unknown in Europe. The 18th century American mathematician Hugh 
Jones used the words “octavation” and “decimation” to describe octal/decimal 
conversions, but his methods were not as clever as his terminology. A. M. 
Legendre [Theorie des nombres (Paris: 1798), 229] noted that positive integers 
may be conveniently converted to binary form if they are repeatedly divided 
by 64. 
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In 1946, H. H. Goldstine and J. von Neumann gave prominent consideration 
to radix conversion in their classic memoir, “Planning and coding problems for 
an electronic computing instrument,” because it was necessary to justify the use 
of binary arithmetic; see John von Neumann, Collected Works 5 (New York: 
Macmillan, 1963), 127-142. Another early discussion of radix conversion on 
binary computers was published by F. Koons and S. Lubkin, Math. Comp. 3 
(1949), 427-431, who suggested a rather unusual method. The first discussion 
of floating point conversion was given somewhat later by F. L. Bauer and K. 
Samelson [ Zeit . fiir angewandte Math, und Physik 4 (1953), 312-316]. 

The following articles may be useful for further reference: A note by G. T. 
Lake [CACM 5 (1962), 468-469] mentions some hardware techniques for conver¬ 
sion and gives clear examples. A. H. Stroud and D. Secrest [Comp. J. 6 (1963), 
62-66] have discussed conversion of multiple-precision floating point numbers. 
The conversion of unnormalized floating point numbers, preserving the amount 
of “significance” implied by the representation, has been discussed by H. Kanner 
[JACM 12 (1965), 242-246] and by N. Metropolis and R. L. Ashenhurst [Math. 
Comp. 19 (1965), 435-441]. See also K. Sikdar, Sankhya (B) 30 (1968), 315-334, 
and the references cited in his paper. 


EXERCISES 

► 1. [25] Generalize Method (lb) so that it works with arbitrary mixed-radix nota¬ 
tions, converting 

dmb m —1 . . • 6160 T~ ■ ■ ■ “I - dibo do into Am Pm —i • • • PiPq ~b A\Bq -f- Ao, 

where 0 < dj < bj and 0 < Aj < Bj for 0 < j < m and 0 < J < M. 

Give an example of your generalization by manually converting the quantity 
“3 days, 9 hours, 12 minutes, and 37 seconds” into long tons, hundredweights, stones, 
pounds, and ounces. (Let one second equal one ounce. The British system of weights 
has 1 stone = 14 pounds, 1 hundredweight — 8 stone, 1 long ton = 20 hundredweight.) 
In other words, let 6 0 — 60, bi = 60, 62 = 24, m = 3, Bq = 16, Bi = 14, B 2 — 8, 
B 3 = 20, M — 4; the problem is to find A 4 , ..., Ao in the proper ranges such that 
3 & 2 & 1&0 H~ 26160 + 126o + 37 = A 4 B 3 B 2 B 1 B 0 -j- A 3 B 2 B 1 B 0 A 2 B 1 B 0 A\Bq -j- Ao, 
using a systematic method that generalizes Method (lb). (All arithmetic is to be done 
in a mixed-radix system.) 

2. [25] Generalize Method (la) so that it works with mixed-radix notations, as in 
exercise 1, and give an example of your generalization by manually solving the same 
conversion problem stated in exercise 1. 

► 3 . [25] (D. Taranto.) When fractions are being converted, there is no obvious way to 
decide how many digits to give in the answer. Design a simple generalization of Method 
(2a) that, given two positive radix-6 fractions u and e between 0 and 1, converts u to 
a rounded radix-B equivalent U that has just enough places M to the right of the 
radix point to ensure that \U — u\ < e. (In particular if u is a multiple of b ~ m and 
e = b~ m /2, the value of U will have just enough digits so that u can be recomputed 
exactly, given U and m. Note that M might be zero; for example, if e < § and 
u > 1 — e, the proper answer is U = 1.) 
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4 . [ M21 ] (a) Prove that every real number with a terminating binary representation 
also has a terminating decimal representation, (b) Find a simple condition on the 
positive integers b and B that is satisfied if and only if every real number that has a 
terminating radix-6 representation also has a terminating radix-# representation. 

5. [M20] Show that program (4) would still work if the instruction “LDX =10 n =” 
were replaced by “LDX =e=” for certain other constants c. 

6 . [30] Discuss using Methods (la), (lb), (2a), and (2b) when b or B is — 2. 

7 . [ M18] Given that 0<a<x<a-\-l/w and 0 < u < w, prove that [ux J is 
equal to either [cmj or |_owJ -f- 1. Furthermore [ux\ = [cm] exactly, if u < aw and 
a -1 is an integer. 

8. [24] Write a MIX program analogous to (1) that uses (5) and includes no division 
instructions. 

9. [ M27 ] Let u be an integer, 0 < u < 2 34 . Assume that the following sequence of 
operations (equivalent to addition and binary “shift-right” instructions) is performed: 

v<-LKI> v<-w + LiH i> <-?; + [ 2 -4 H 

v^v + 12~ 8 v\, v <- v -f v <- 

Prove that v — [u/ 10J or [tx/lOj — 1. 

10 . [22] The text shows how a binary-coded decimal number can be doubled by using 
various shifting, extracting, and addition operations on a binary computer. Give an 
analogous method that computes half of a binary-coded decimal number (throwing 
away the remainder if the number is odd). 

11 . [16] Convert (57721)g to decimal. 

► 12. [22] Invent a rapid pencil-and-paper method for converting integers from ternary 
notation to decimal, and illustrate your method by converting ( 1212011210210)3 into 
decimal. How would you go from decimal to ternary? 

► 13 . [25] Assume that locations U -f- 1, U -f- 2, ..., U -\-m contain a multiple-precision 
fraction (.u—iU —2 ■ ■ ■ u— m )b, where b is the word size of MIX. Write a MIX routine that 
converts this fraction to decimal notation, truncating it to 180 decimal digits. The 
answer should be printed on two lines, with the digits grouped into 20 blocks of nine 
each separated by blanks. (Use the CHAR instruction.) 

► 14 . [ M27] (A. Schonhage.) The text’s method of converting multiple-precision in¬ 
tegers requires an execution time of order n 2 to convert an n-place integer, when n 
is large. Show that it is possible to convert n-digit decimal integers into binary nota¬ 
tion in 0(M(n) log n) steps, where M(n) is an upper bound on the number of steps 
needed to multiply n-bit binary numbers that satisfies the “smoothness condition” 
M(2n) > 2M(n). 

15 . [M47] Can the upper bound on the time to convert large integers, given in exer¬ 
cise 14, be substantially lowered? (Cf. exercise 4.3.3-12.) 

16 . [41] Construct a fast linear iterative array for radix conversion from decimal to 
binary (cf. Section 4.3.3). 

17 . [M 40 ] Design “ideal” floating point conversion subroutines, taking p-digit decimal 
numbers into P-digit binary numbers and vice versa, in both cases producing a true 
rounded result in the sense of Section 4.2.2. 
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18. [HMS4] (David W. Matula.) Let roundb(w,p) be the function of b, u, and p that 
represents the best p-digit base b floating point approximation to u, in the sense of 
Section 4.2.2. Under the assumption that log B & is irrational and that the range of 
exponents is unlimited, prove that 

u = roundt>(roundB(ti, P), p ) 

holds for all p-digit base b floating point numbers u if and only if B p ~ 1 > b p . (In 
other words, an “ideal” input conversion of u into an independent base B, followed by 
an “ideal” output conversion of this result, will always yield u again if and only if the 
intermediate precision P is suitably large, as specified by the formula above.) 

19. [MSS] Let the decimal number u = (uj ... Uiiio)io be represented as the binary- 
coded decimal number U = (u 7 .. .uiiio)i6- Find appropriate constants Ci and masks 
rrii so that the operation U <— U — Ci(U A rrii), repeated for i = 1, 2, 3, will convert U 
to the binary representation of u, where “A” denotes extraction (i.e., “logical AND” on 
individual bits). 
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4.5. RATIONAL ARITHMETIC 

It IS OFTEN important to know that the answer to some numerical problem 
is exactly J, not a floating point number that gets printed as “0.333333574”. 
If arithmetic is done on fractions instead of on approximations to fractions, 
many computations can be done entirely without any accumulated rounding 
errors. This results in a comfortable feeling of security that is often lacking when 
floating point calculations have been made, and it means that the accuracy of 
the calculation cannot be improved upon. 


4.5.1. Fractions 

When fractional arithmetic is desired, the numbers can be represented as 
pairs of integers, ( u/u' ), where u and u' are relatively prime to each other and 
u' > 0. The number zero is represented as (0/1). In this form, (u/u f ) = {v/v') 
if and only if u = v and v! = v'. 

Multiplication of fractions is, of course, easy; to form (u/u') X ( v/v ') = 
{w/w'), we can simply compute uv and u'v' . The two products uv and v!v' 
might not be relatively prime, but if d = gcd {uv,u'v'), the desired answer is 
w — uv/d, w' = u'v' jd. (See exercise 2.) Efficient algorithms to compute the 
greatest common divisor are discussed in Section 4.5.2. 

Another way to perform the multiplication is to find di — gcd (u,v') and 
d 2 = gcd(ii',v); then the answer is w = (u/di)(v/d 2 ), w' = (u'/d 2 )(v'/d 1 ). (See 
exercise 3.) This method requires two gcd calculations, but it is not really slower 
than the former method; the gcd process involves a number of iterations that 
is essentially proportional to the logarithm of its inputs, so the total number of 
iterations needed to evaluate both d\ and d 2 is essentially the same as the number 
of iterations during the single calculation of d. Furthermore, each iteration in 
the evaluation of d\ and d 2 is potentially faster, because comparatively small 
numbers are being examined. If u, u', v, and v' are single-precision quantities, 
this method has the advantage that no double-precision numbers appear in the 
calculation unless it is impossible to represent both of the answers w and w' in 
single-precision form. 

Division may be done in a similar manner; see exercise 4. 

Addition and subtraction are slightly more complicated. The obvious pro¬ 
cedure is to set {u/u') ^ (v/v') = {{uv' ^ u'v)/u'v') and then to reduce this 
fraction to lowest terms by calculating d = gcd {uv' i u'v, u'v') as in the first 
multiplication method. But again it is possible to avoid working with such large 
numbers, if we start by calculating d\ = gcd {u',v'). If d\ = 1, then the desired 
numerator and denominator are w — uv' ^ u'v and w' = u'v'. (According to 
Theorem 4.5.2D, di will be 1 about 61 percent of the time, if the denominators u' 
and v' are randomly distributed, so it is wise to single this case out separately.) 
If d\ > 1, then let t = u{v'/di) ± v{u’/d-i) and calculate d 2 — gcd(£, di); 
finally the answer is w = t/d 2 , w' = {u'/di){v'/d 2 ). (Exercise 6 proves that 
these values of w and w' are relatively prime to each other.) If single-precision 
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numbers are being used, this method requires only single-precision operations, 
except that t may be a double-precision number or slightly larger (see exercise 7); 
since gcd (t, di) = gcd(£moddi, di), the calculation of d 2 does not require double 
precision. 

For example, to compute (7/66) + (17/12), we form di = gcd(66,12) = 6; 
then t = 7 • 2 -{- 17 • 11 = 201, and d 2 = gcd(201,6) = 3, so the answer is 


201 


66 12 
~6~ T 


= 67/44. 


To help check out subroutines for rational arithmetic, inversion of matrices 
with known inverses (e.g., Cauchy matrices, exercise 1.2.3-41) is suggested. 

Experience with fractional calculations shows that in many cases the num¬ 
bers grow to be quite large. So if u and u' are intended to be single-precision 
numbers for each fraction ( u/u' ), it is important to include tests for overflow 
in each of the addition, subtraction, multiplication, and division subroutines. 
For numerical problems in which perfect accuracy is important, a set of sub¬ 
routines for fractional arithmetic with arbitrary precision allowed in numerator 
and denominator is very useful. 

The methods of this section extend also to other number fields besides the 
rational numbers; for example, we could do arithmetic on quantities of the form 
(u + u'y/%)/u" 1 where u, u', u" are integers, gcd (u,v! ,u") = 1, and u" > 0; or 
on quantities of the form (u u!\J 2 -)- u"\f4)/u'", etc. 

Instead of insisting on exact calculations with fractions, it is interesting to 
consider also “fixed slash” and “floating slash” numbers, which are analogous to 
floating point numbers but based on rational fractions instead of radix-oriented 
fractions. In a binary fixed-slash scheme, the numerator and denominator of 
a representable fraction each consist of at most p bits, for some given p. In a 
floating-slash scheme, the sum of numerator bits plus denominator bits must be a 
total of at most q, for some given q, and another field of the representation is used 
to indicate how many of these q bits belong to the numerator. To do arithmetic 
on such numbers, we define x 0 y = round(z + y), x © y = round(x — y), 
etc., where round(a:) — x if x is representable, otherwise it is one of the two 
representable numbers that surround x. 

It may seem at first that the best definition of round(x) would be to choose 
the representable number that is closest to x, by analogy with the way we round 
in floating point arithmetic. But experience has shown that it is best to bias our 
rounding towards “simple” numbers, since numbers with small numerator and 
denominator occur much more often than complicated fractions do. We want 
more numbers to round to ^ than to JfS • The rounding rule that turns out to 
be most successful in practice is called “mediant rounding”: If (u/u') and (v/v’) 
are adjacent representable numbers, so that whenever u/u' < x < v/v' we must 
have round(rc) equal to (u/u') or (v/v'), the mediant rounding rule says that 

x u _ u-\-v , v n u-\-v . . 

roundm = — lor x < -, round(z) = — for x > - . (1) 

w u' u'-\-v' w v' u'-\-v' w 



4 . 5.1 


FRACTIONS 315 


If x = (u -f- v)/(u' -f- v') exactly, we let round(x) be the neighboring fraction 
with the smallest denominator (or, if u' = v ', with the smallest numerator). 

For example, suppose we are doing fixed slash arithmetic with p = 8, so that 
the representable numbers (u/v!) have —128 < u < 128 and 0 < v < 256 and 
gcd(u,u') = 1. This isn’t much precision, but it is enough to give us a feel for 
slash arithmetic. The numbers adjacent to 0 = (0/1) are (—1/255) and (1/255); 
according to the mediant rounding rule, we will therefore have round(x) = 0 
if and only if |a:| < 1/256. Suppose we have a calculation that would take the 
overall form if we were working in exact rational arithmetic, but 

the intermediate quantities have had to be rounded to representable numbers. In 
this case fJJ would round to (79/40) and would round to (7/6). The sum 
-j- l = f-Jjj rounds to (22/7), so we have obtained the correct answer even 
though three roundings were required. This example was not specially contrived; 
when the answer to a problem is a simple fraction, slash arithmetic tends to make 
the intermediate rounding errors cancel out. 

Exact representation of fractions within a computer was first discussed in 
the literature by P. Henrici, JACM 3 (1956), 6-9. Fixed and floating slash 
arithmetic was proposed by David W. Matula, in Applications of Number Theory 
to Numerical Analysis , ed. by S. K. Zaremba (New York: Academic Press, 
1972), 486-489. Further developments of the idea are discussed by Matula and 
Kornerup in Proc. IEEE Symp. Computer Arith. 4 (1978), 29-47; Lecture Notes 
in Comp. Sci . 72 (1979), 383-397; Computing, Suppl. 2 (1980), 85-111. 


EXERCISES 

1. [15] Suggest a reasonable computational method for comparing two fractions, to 
test whether or not {u/u') < [v/v'). 

2. [ M15 ] Prove that if d = gcd(u, v ) then ufd and v/d are relatively prime. 

3. [ M20 ] Prove that if u and u' are relatively prime, and if v and v' are relatively 
prime, then gcd(m>, u'v ') = gcd(u, t/)gcd(u', v). 

4 . [11] Design a division algorithm for fractions, analogous to the second multiplica¬ 
tion method of the text. (Note that the sign of v must be considered.) 

5. [10] Compute (17/120) -(- (—27/70) by the method recommended in the text. 

► 6. [ M23 ] Show that if u, u' are relatively prime and if v, v' are relatively prime, 
then gcd {uv' + vu', u'v') = did 2 , where d\ = gcd(u/, v') and = gcd(di, u(v'/di) + 
v(u'/di)). (Hence if di = 1, then uv' + vu' is relatively prime to u'v '.) 

7 . [M22] How large can the absolute value of the quantity t become, in the addition- 
subtraction method recommended in the text, if the numerators and denominators of 
the inputs are less than N in absolute value? 

► 8. [22] Discuss using (1/0) and (—1/0) as representations for oo and —oo, and/or 
as representations of overflow. 

9. [M23] If 1 < u', v' < 2 n , show that [2 2n u/u'\ = \_2 2n v/v'\ implies u/u' — v/v'. 

10 . [41] Extend the subroutines suggested in exercise 4.3.1-34 so that they deal with 
“arbitrary” rational numbers. 
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11. [M23] Consider fractions of the form (-u-f- u'y/5 )/u ", where u, u r , u" are integers, 
gcd(u, u' ,u") = 1, and u" > 0. Explain how to divide two such fractions and to obtain 
a quotient having the same form. 

12 . [20] (Matula and Kornerup.) Discuss the representation of floating slash numbers 
in a 32-bit word. 

13 . [ M23] Explain how to compute the exact number of pairs of integers ( u , u ') such 
that 1 < u < M and Ni < u' < N 2 and gcd(u, u') = 1. (This can be used 
to determine how many numbers are representable in slash arithmetic. According to 
Theorem 4.5.2D, the number will be approximately ( Q/7r 2 )M(N 2 — A/i).) 

14. [42] Modify one of the compilers at your installation so that it will replace all 
floating point calculations by floating slash calculations. Experiment with the use of 
slash arithmetic by running existing programs that were written by programmers who 
actually had floating point arithmetic in mind. (When special subroutines like square 
root or logarithm are called, your system should automatically convert slash numbers 
to floating point form before the subroutine is invoked, then back to slash form again 
afterwards. There should be a new option to print slash numbers in a fractional format; 
however, if you make no changes to a user’s source program, you probably will have 
to print slash numbers in decimal notation, in order to keep from messing up any 
column alignments.) Are the results better or worse, when floating slash numbers are 
substituted? 


4.5.2. The Greatest Common Divisor 

If u and v are integers, not both zero, we say that their greatest common divisor , 
gcd(u, v), is the largest integer that evenly divides both u and v. This definition 
makes sense, because if u 7 ^ 0 then no integer greater than |u| can evenly 
divide u, but the integer 1 does divide both u and v; hence there must be a 
largest integer that divides them both. When u and v are both zero, every integer 
evenly divides zero, so the above definition does not apply; it is convenient to 
set 

gcd( 0 , 0 ) = 0 . ( 1 ) 

The definitions just given obviously imply that 


gcd(u, v) = gcd(v, u), 

( 2 ) 

gcd(n, v) = gcd(— u, v ), 

( 3 ) 

gcd(w, 0 ) = |u|. 

( 4 ) 


In the previous section, we reduced the problem of expressing a rational 
number in “lowest terms” to the problem of finding the greatest common divisor 
of its numerator and denominator. Other applications of the greatest common 
divisor have been mentioned for example in Sections 3. 2 . 1 . 2 , 3.3.3, 4.3.2, 4.3.3. 
So the concept of gcd (u, v) is important and worthy of serious study. 

The least common multiple of two integers u and v, written lcm(u, v), is a 
related idea that is also important. It is defined to be the smallest positive integer 
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that is a multiple of (i.e., evenly divisible by) both u and v\ and lcm( 0 , 0 ) = 0 . 
The classical method for teaching children how to add fractions u/u' -}- v/v* is 
to train them to find the “least common denominator,” which is lem(V, v'). 

According to the “fundamental theorem of arithmetic” (proved in exercise 
1.2.4-21), each positive integer u can be expressed in the form 

u = 2 U2 3' U3 5 U5 7 U7 ll Ul1 ... = JJ p u *>, (5) 

p prime 

where the exponents 1 x 2 , 1 x 3 , ... are uniquely determined nonnegative integers, 
and where all but a finite number of the exponents are zero. From this canonical 
factorization of a positive integer, it is easy to discover one way to compute the 
greatest common divisor of u and v: By ( 2 ), (3), and (4), we may assume that u 
and v are positive integers, and if both of them have been canonically factored 
into primes we have 


gcd(?X,n) = JJ pmm{u p ,v p ), (6) 

p prime 

lcm(u, v) — p max ( u pV P ) t ( 7 ) 

p prime 


Thus, for example, the greatest common divisor of u — 7000 = 2 3 * 5 3 • 7 and 
v = 4400 = 2 4 • 5 2 * 11 is 2 min ( 3,4 ) 5 min ( 3 > 2 ) 7min(l,0) ^min( 0 ,l) _ 2 3 . 52 __ 200 . 

The least common multiple of the same two numbers is 2 4 • 5 3 • 7 • 11 = 154000. 

From formulas ( 6 ) and (7) we can easily prove a number of basic identities 
concerning the gcd and the 1 cm: 


gcd(?x, v)w = gcd(mu, vw) f 

if w > 0 ; 

( 8 ) 

lcm (u, v)w — lcm (uw, vw), 

if w > 0 ; 

(9) 

U‘V = gcd(it, v) ■ lcm(ii, v), 

if u,v > 0 ; 

( 10 ) 

gcd(lcm(xx, v), lcm(u, u;)) = lcm(ix, gcd(n, w)); 


(ID 

lcm(gcd(w, v), gcd(ix, w)) = gcd(?x, lcm(n, w)). 


(12) 


The latter two formulas are “distributive laws” analogous to the familiar identity 
uv + uw = u(v 4- w). Equation (10) reduces the calculation of gcd(u, v) to the 
calculation of lcm (u,v), and conversely. 

Euclid’s algorithm. Although Eq. ( 6 ) is useful for theoretical purposes, it is 
generally no help for calculating a greatest common divisor in practice, because 
it requires that we first determine the factorization of u and v. There is no 
known method for finding the prime factors of an integer very rapidly (see Section 
4.5.4). But fortunately there is an efficient way to calculate the greatest common 
divisor of two integers without factoring them, and, in fact, such a method 
was discovered over 2250 years ago; this is “Euclid’s algorithm,” which we have 
already examined in Sections 1.1 and 1 . 2 . 1 . 
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Euclid’s algorithm is found in Book 7, Propositions 1 and 2 of his Elements 
(c. 300 B.C.), but it probably wasn’t his own invention. Scholars believe that 
the method was known up to 200 years earlier, at least in its subtractive form, 
and it was almost certainly known to Eudoxus (c. 375 B.C.); cf. K. von Fritz, 
Ann. Math. (2) 46 (1945), 242-264. We might call it the granddaddy of all al¬ 
gorithms, because it is the oldest nontrivial algorithm that has survived to the 
present day. (The chief rival for this honor is perhaps the ancient Egyptian 
method for multiplication, which was based on doubling and adding, and which 
forms the basis for efficient calculation of nth powers as explained in Section 
4.6.3. But the Egyptian manuscripts merely give examples that are not com¬ 
pletely systematic, and these examples were certainly not stated systematically; 
the Egyptian method is therefore not quite deserving of the name “algorithm.” 
Several ancient Babylonian methods, for doing such things as solving special 
sets of quadratic equations in two variables, are also known. Genuine algorithms 
are involved in this case, not just special solutions to the equations for certain 
input parameters; for even though the Babylonians invariably presented each 
method in conjunction with an example worked with particular input data, they 
regularly explained the general procedure in the accompanying text. [See D. E. 
Knuth, CACM 15 (1972), 671-677; 19 (1976), 108.] Many of these Babylonian 
algorithms predate Euclid by 1500 years, and they are the earliest known in¬ 
stances of written procedures for mathematics. But they do not have the stature 
of Euclid’s algorithm, since they do not involve iteration and since they have 
been superseded by modern algebraic methods.) 

In view of the importance of Euclid’s algorithm, for historical as well as 
practical reasons, let us now consider how Euclid himself treated it. Paraphrasing 
his words into modern terminology, this is essentially what he wrote: 

Proposition. Given two positive integers, find their greatest common divisor. 

Let A, C be the two given positive integers; it is required to find their greatest 
common divisor. If C divides A, then C is a common divisor of C and A, since it 
also divides itself. And it clearly is in fact the greatest, since no greater number 
than C will divide C. 

But if C does not divide A, then continually subtract the lesser of the numbers 
A, C from the greater, until some number is left that divides the previous one. 
This will eventually happen, for if unity is left, it will divide the previous number. 

Now let E be the positive remainder of A divided by C; let F be the positive 
remainder of C divided by E\ and let F be a divisor of E. Since F divides E and 
E divides C — F, F also divides C — F; but it also divides itself, so it divides 
C. And C divides A — E; therefore F also divides A — E. But it also divides E; 
therefore it divides A. Hence it is a common divisor of A and C. 

I now claim that it is also the greatest. For if F is not the greatest common divisor 
of A and C, some larger number will divide them both. Let such a number be G. 

Now since G divides C while C divides A— E, G divides A—E.G also divides the 
whole of A, so it divides the remainder E. But E divides C — F; therefore G also 
divides C — F. And G also divides the whole of C, so it divides the remainder F; 
that is, a greater number divides a smaller one. This is impossible. 
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Therefore no number greater than F will divide A and C, so F is their greatest 
common divisor. 

Corollary. This argument makes it evident that any number dividing two numbers 
divides their greatest common divisor. Q.E.D. 

Note. Euclid’s statements have been simplified here in one nontrivial respect: 
Greek mathematicians did not regard unity as a “divisor” of another positive 
integer. Two positive integers were either both equal to unity, or they were 
relatively prime, or they had a greatest common divisor. In fact, unity was not 
even considered to be a “number,” and zero was of course nonexistent. These 
rather awkward conventions made it necessary for Euclid to duplicate much of 
his discussion, and he gave two separate propositions that are each essentially 
like the one appearing here. 

In his discussion, Euclid first suggests subtracting the smaller of the two 
current numbers from the larger, repeatedly, until we get two numbers where one 
is a multiple of the other. But in the proof he really relies on taking the remainder 
of one number divided by another; and since he has no simple concept of zero, 
he cannot speak of the remainder when one number divides the other. It is 
reasonable to say that he imagines each division (not the individual subtractions) 
as a single step of the algorithm, and hence an “authentic” rendition of his 
algorithm can be phrased as follows: 

Algorithm E (Original Euclidean algorithm). Given two integers A and C greater 
than unity, this algorithm finds their greatest common divisor. 

El. [A divisible by C?] If C divides A, the algorithm terminates with C as the 
answer. 

E2. [Replace A by remainder.] If AmodC is equal to unity, the given numbers 
were relatively prime, so the algorithm terminates. Otherwise replace the 
pair of values (A,C) by (C,AmodC) and return to step El. | 

The “proof’ Euclid gave, which is quoted above, is especially interesting 
because it is not really a proof at all! He verifies the result of the algorithm only 
if step El is performed once or thrice. Surely he must have realized that step El 
could take place more than three times, although he made no mention of such 
a possibility. Not having the notion of a proof by mathematical induction, he 
could only give a proof for a finite number of cases. (In fact, he often proved 
only the case n = 3 of a theorem that he wanted to establish for general n.) 
Although Euclid is justly famous for the great advances he made in the art 
of logical deduction, techniques for giving valid proofs by induction were not 
discovered until many centuries later, and the crucial ideas for proving the 
validity of algorithms are only now becoming really clear. (See Section 1.2.1 
for a complete proof of Euclid’s algorithm, together with a short discussion of 
general proof procedures for algorithms.) 

It is worth noting that this algorithm for finding the greatest common divisor 
was chosen by Euclid to be the very first step in his development of the theory 
of numbers. The same order of presentation is still in use today in modern 
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textbooks. Euclid also gave a method (Proposition 34) to find the least common 
multiple of two integers u and v, namely to divide u by gcd(it, v) and to multiply 
the result by v; this is equivalent to Eq. (10). 

If we avoid Euclid’s bias against the numbers 0 and 1, we can reformulate 
Algorithm E in the following way. 

Algorithm A (Modern Euclidean algorithm). Given nonnegative integers u and v, 
this algorithm finds their greatest common divisor. [Note: The greatest common 
divisor of arbitrary integers u and v may be obtained by applying this algorithm 
to \u\ and |i>|, because of Eqs. (2) and (3).) 

Al. [t> = 0?] If v = 0, the algorithm terminates with u as the answer. 

A2. [Take umodi>.] Set r <— 'umod'u, u <— v, v +— r, and return to Al. (The 
operations of this step decrease the value of v, but they leave gcd(u, v) 
unchanged.) | 

For example, we may calculate gcd(40902,24140) as follows: 

gcd(40902,24140) = gcd(24140,16762) = gcd(16762,7378) 

= gcd(7378,2006) = gcd(2006,1360) = gcd(1360,646) 

= gcd(646,68) = gcd(68,34) = gcd(34,0) = 34. 

A proof that Algorithm A is valid follows readily from Eq. (4) and the fact 

that 

gcd(u, v) = gcd(v, u — qv), (13) 

if q is any integer. Equation (13) holds because any common divisor of u and v 
is a divisor of both v and u — qv, and, conversely, any common divisor of v and 
u — qv must divide both u and v. 

The following MIX program illustrates the fact that Algorithm A can easily 
be implemented on a computer: 

Program A (Euclid’s algorithm). Assume that u and v are single-precision, 
nonnegative integers, stored respectively in locations U and V; this program puts 
gcd(u, v) into rA. 



LDX 

U 

1 

rX <— u. 


JMP 

2F 

1 


1H 

STX 

V 

T 

v <— rX. 


SRAX 

5 

T 

rAX «- rA. 


DIV 

V 

T 

rX •<— rAX modi;. 

2H 

LDA 

V 

i + r 

rA <— v. 


JXNZ 

IB 

1 + T 

Done if rX = 0. | 


The running time for this program is 19T -(-6 cycles, where T is the number 
of divisions performed. The discussion in Section 4.5.3 shows that we may take 
T = 0.842766 In AT -f- 0.06 as an approximate average value, when u and v are 
independently and uniformly distributed in the range 1 < u.v < N. 
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Fig. 9. Binary algorithm for the greatest common divisor. 

A binary method. Since Euclid’s patriarchal algorithm has been used for so many 
centuries, it is a rather surprising fact that it may not always be the best method 
for finding the greatest common divisor after all. A quite different gcd algorithm, 
which is primarily suited to binary arithmetic, was discovered by J. Stein in 1961 
[see J. Comp. Phys. 1 (1967), 397-405]. This new algorithm requires no division 
instruction; it relies solely on the operations of (i) subtraction, (ii) testing whether 
a number is even or odd, and (iii) shifting the binary representation of an even 
number to the right (halving). 

The binary gcd algorithm is based on four simple facts about positive integers 
u and v: 

a) If u and v are both even, then gcd(u, v) = 2gcd(w/2,t>/2). [See Eq. (8).] 

b) If u is even and v is odd, then gcd(w, v) = gcd(u/2, v). [See Eq. ( 6 ).] 

c) As in Euclid’s algorithm, gcd (u, v ) = gcd(tt — v, v). [See Eqs. (13), (2).] 

d) If u and v are both odd, then u — v is even, and | u — w| < max(tz, v). 

These facts immediately suggest the following algorithm: 

Algorithm B ( Binary gcd algorithm). Given positive integers u and v, this 
algorithm finds their greatest common divisor. 

Bl. [Find power of 2.] Set k +— 0, and then repeatedly set k k -\- 1, u <— u/2, 
v <— v/2, zero or more times until u and v are not both even. 

B2. [Initialize.] (Now the original values of u and v have been divided by 2 fc , 

and at least one of their present values is odd.) If u is odd, set t < - v and 

go to B4. Otherwise set t <— u. 

B3. [Halve t.} (At this point, t is even, and nonzero.) Set t <— tf 2 . 

B4. [Is t even?] If t is even, go back to B3. 

B5. [Reset max(ii, v).] If t > 0, set u <— t; otherwise set v <- 1. (The larger of 

u and v has been replaced by |t|, except perhaps during the first time this 
step is performed.) 

B6. [Subtract.] Set t +- u — v. If t 7 ^ 0, go back to B3. Otherwise the algorithm 
terminates with u • 2 k as the output. | 
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As an example of Algorithm B, let us consider u = 40902, v = 24140, the 
same numbers we have used to try out Euclid’s algorithm. Step B1 sets k <— 1, 
u <— 20451, v <— 12070. Then t is set to —12070, and replaced by —6035; then 
v is replaced by 6035, and the computation proceeds as follows: 


u 

V 

t 

20451 

6035 

+14416, +7208, +3604, +1802, +901; 

901 

6035 

—5134, —2567; 

901 

2567 

—1666, —833; 

901 

833 

+68, +34, +17; 

17 

833 

—816, —408, —204, —102, —51; 

17 

51 

-34, -17; 

17 

17 

0 . 


The answer is 17 • 2 1 = 34. A few more iterations were necessary here than 
we needed with Algorithm A, but each iteration was somewhat simpler since no 
division steps were used. 

A MIX program for Algorithm B requires just a little more code than for 
Algorithm A. In order to make such a program fairly typical of a binary 
computer’s representation of Algorithm B, let us assume that MIX is extended to 
include the following operators: 

• SLB (shift left AX binary). C = 6 ; F = 6 . 

The contents of registers A and X are “shifted left” M binary places; that is, 
|rAX| <— |2 M rAX| modB 10 , where B is the byte size. (As with all MIX shift 
commands, the signs of rA and rX are not affected.) 

• SRB (shift right AX binary). C = 6 ; F = 7. 

The contents of registers A and X are “shifted right” M binary places; that is, 
|rAX| «- [|rAX|/ 2 M J. 

• JAE, JA0 (jump A even, jump A odd). C = 40; F = 6 , 7, respectively. 

A JMP occurs if rA is even or odd, respectively. 

• JXE, JX0 (jump X even, jump X odd). C = 47; F = 6 , 7, respectively. 
Analogous to JAE, JA0. 

Program B ( Binary gcd algorithm). Assume that u and v are single-precision 
positive integers, stored respectively in locations U and V; this program uses 
Algorithm B to put gcd(u, v) into rA. Register assignments: t = rA, k = rll. 


01 

ABS 

EQU 

1:5 



02 

B1 

ENT1 

0 

1 

Bl. Find power of 2. 

03 


LDX 

u 

1 

rX <— u. 

04 


LDAN 

V 

1 

rA«- v. 

05 


JMP 

IF 

1 


06 

2H 

SRB 

1 

A 

Halve rA, rX. 

07 


INC1 

1 

A 

k «- fc + 1. 

08 


STX 

U 

A 

se 

t 

e 

to 

09 


STA 

V (ABS) 

A 

v <— v/2. 
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10 

1H 

JX0 

B4 

1+A 

To B4 with t 4 - v if u is odd 

11 

B2 

JAE 

2B 

B -\- A 

B2. Initialize. 

12 


LDA 

U 

B 

1 4 — u. 

13 

B3 

SEB 

1 

D 

B3. Halve t. 
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The running time of this program is 

9A + 2B + 6C + 3D + E + 13 

units, where A — k, B = 1 if t <— u in step B2 (otherwise B = 0), C is the 
number of subtraction steps, D is the number of halvings in step B3, and E is 
the number of times £ > 0 in step B5. Calculations discussed later in this section 
imply that we may take A = ^, B = C — 0.71n — 0.5, D = 1.41n — 2.7, 
E = 0.35n — 0.4 as average values for these quantities, assuming random inputs 
u and v in the range 1 < u, v < 2 n . The total running time is therefore about 
8 . 8 n -j- 5 cycles, compared to about 11.In -)- 7 for Program A under the same 
assumptions. The worst possible running time for u and v in this range occurs 
when A = 0, B = 0, C = n, D = 2n — 2 , E = 1 ; this amounts to 12n -f- 8 
cycles. (The corresponding value for Program A is 26.8n -J- 19.) 

Thus the greater speed of the iterations in Program B, due to the simplicity 
of the operations, compensates for the greater number of iterations required. We 
have found that the binary algorithm is about 20 percent faster than Euclid’s 
algorithm on the MIX computer. Of course, the situation may be different 
on other computers, and in any event both programs are quite efficient; but 
it appears that not even a procedure as venerable as Euclid’s algorithm can 
withstand progress. 

V. C. Harris [Fibonacci Quarterly 8 (1970), 102-103] has suggested an inter¬ 
esting cross between Euclid’s algorithm and the binary algorithm. If u and v are 
odd, with u > v y 0, we can always write u = qv ^ r where 0 < r < v and r 
is even; if r 7 ^ 0 we set r <— r/2 until r is odd, then set u 4 — v, v <— r and repeat 
the process. In subsequent iterations, q > 3. 

Extensions. We can extend the methods used to calculate gcd(w, v) in order to 
solve some slightly more difficult problems. For example, assume that we want 
to compute the greatest common divisor of n integers U\, 112, ..., u n . 

One way to calculate gcd(ui, U2, ..., u n ), assuming that the u’s are all non¬ 
negative, is to extend Euclid’s algorithm in the following way: If all Uj are zero, 




324 ARITHMETIC 


4.5.2 


the greatest common divisor is taken to be zero; otherwise if only one Uj is 
nonzero, it is the greatest common divisor; otherwise replace u & by Ukmoduj 
for all k 7 ^ j, where Uj is the minimum of the nonzero u > s. 

The algorithm sketched in the preceding paragraph is a natural generaliza¬ 
tion of Euclid’s method, and it can be justified in a similar manner. But there 
is a simpler method available, based on the easily verified identity 

gcd(ui, u n ) = gcd(u 1 ,gcd(u 2 ,...,u n )). (14) 

To calculate gcd(«i, u 2 ,..., u n ), we ma y therefore proceed as follows: 

Cl. Set d <— u n , j <— n — 1. 

C2. If d 1 and j > 0, set d <— gcd(u :? , d) and j <— j — 1 and repeat this step. 
Otherwise d = gcd(ui,..., u n ). 

This method reduces the calculation of gcd(wi,... ,u n ) to repeated calculations 
of the greatest common divisor of two numbers at a time. It makes use of the 
fact that gcd(ui,..., Uj, 1 ) = 1 ; and this will be helpful, since we will already 
have gcd(tt n _i, tt n ) = 1 over 60 percent of the time if u n _i and u n are chosen 
at random. In most cases, the value of d will decrease rapidly during the first few 
stages of the calculation, and this will make the remainder of the computation 
quite fast. Here Euclid’s algorithm has an advantage over Algorithm B, in that its 
running time is primarily governed by the value of min(u, v), while the running 
time for Algorithm B is primarily governed by max(u, v); it would be reasonable 
to perform one iteration of Euclid’s algorithm, replacing u by umodv if u is 
much larger than v, and then to continue with Algorithm B. 

The assertion that gcd(u n _i, u n ) will be equal to unity more than 60 percent 
of the time for random inputs is a consequence of the following well-known result 
of number theory: 

Theorem D (G. Lejeune Dirichlet, Abhandlungen Koniglich PreuB. Akad. Wiss. 
(1849), 69-83). If u and v are integers chosen at random, the probability that 
gcd (u,v) = 1 is 6 / 7 T 2 « .60793. 

A precise formulation of this theorem, which carefully defines what is meant 
by being “chosen at random,” appears in exercise 10 with a rigorous proof. Let 
us content ourselves here with a heuristic argument that shows why the theorem 
is plausible. 

If we assume, without proof, the existence of a well-defined probability p that 
gcd(u, v) equals unity, then we can determine the probability that gcd(u, v) = d 
for any positive integer d , because gcd(u, u) = d if and only if u is a multiple 
of d and v is a multiple of d and gcd (u/d,v/d) = 1. Thus the probability that 
gcd(u, v) = d is equal to 1/d times 1/d times p, namely p/d 2 . Now let us sum 
these probabilities over all possible values of d; we should get 

1 = E p / d2 = + 4 + 1 + h +"' 

d> 1 y 10 

Since the sum 1 + i + i * * * = is equal to tt 2 /6 (cf. Section 1 . 2 . 7 ), we 
need p = 6 /V 2 in order to make this equation come out right. | 
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Euclid’s algorithm can be extended in another important way: We can 
calculate integers v! and v' such that 

uu' w' = gcd (u, v) (15) 

at the same time gcd(u, v) is being calculated. This extension of Euclid’s algo¬ 
rithm can be described conveniently in vector notation: 

Algorithm X (Extended Euclid's algorithm ). Given nonnegative integers u and v, 
this algorithm determines a vector ( ui,u 2 ,u 3 ) such that uu\ VU 2 =113 = 
gcd(u, v). The computation makes use of auxiliary vectors (vi,v 2 ,v 3 ), (ti,t 2 ,t 3 ); 
all vectors are manipulated in such a way that the relations 

uti + Vt2 = h, UUi J r VU2 = U3, UVi-\-VV2 = V3 (16) 

hold throughout the calculation. 

XI. [Initialize.] Set (ui, u 2 , u 3 ) (1,0, u), (vi,v 2 ,v 3 ) <- (0, l,v). 

X2. [Is V 3 = 0?] If ^3 = 0, the algorithm terminates. 

X3. [Divide, subtract.] Set q *— /^ 3 J, and then set 

(ti,t 2 ,t 3 ) <- (v>i,U 2 ,u 3 ) — (vi,v 2 ,v 3 )q, 

(Ui,U2,U 3 ) <- (Vx,V2,V 3 ), (Vi,V 2 ,Vs) <- (ti,t2,t 3 ). 

Return to step X2. | 

For example, let u = 40902, v — 24140. At step X2 we have 


<? 

u 1 

u 2 

U3 

Vi 

V2 

^3 

— 

1 

0 

40902 

0 

1 

24140 

1 

0 

1 

24140 

1 

— 1 

16762 

1 

1 

—1 

16762 

—1 

2 

7378 

2 

—1 

2 

7378 

3 

-5 

2006 

3 

3 

-5 

2006 

-10 

17 

1360 

1 

—10 

17 

1360 

13 

—22 

646 

2 

13 

—22 

646 

—36 

61 

68 

9 

-36 

61 

68 

337 

-571 

34 

2 

337 

—571 

34 

—710 

1203 

0 


The solution is therefore 337 • 40902 — 571 • 24140 = 34 = gcd(40902,24140). 

The validity of Algorithm X follows from (16) and the fact that the algorithm 
is identical to Algorithm A with respect to its manipulation of u 3 and v 3 ; a 
detailed proof of Algorithm X is discussed in Section 1.2.1. Gordon H. Bradley 
has observed that we can avoid a good deal of the calculation in Algorithm X 
by suppressing u 2 , v 2 , and t 2 ; then u 2 can be determined afterwards using the 
relation uu\ + vu 2 = u 3 . 

Exercise 14 shows that the values of |wi|, \u 2 \, |^i|, \v 2 \ remain bounded by 
the size of the inputs u and v. Algorithm B, which computes the greatest common 
divisor using properties of binary notation, can be extended in a similar way; 
see exercise 35. For some instructive extensions to Algorithm X, see exercises 18 
and 19 in Section 4.6.1. 
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The ideas underlying Euclid’s algorithm can also be applied to find a general 
solution in integers of any set of linear equations with integer coefficients. For 
example, suppose that we want to find all integers w, x, y, z that satisfy the two 
equations 

lOiu + 3z + 3y + 82 = 1, (17) 

Sw — lx —5z = 2. (18) 

We can introduce a new variable 

[10/3ju> + [3/3jz + [3/3 J y + [ 8 / 3 + = 3w + x + y+ 2z = t lf 

and use it to eliminate y; Eq. (17) becomes 

(10 mod 3 )w + (3 mod 3)x + 3*i + (8 mod 3 )z — w + 3*i + 2z = 1, (19) 

and Eq. (18) remains unchanged. The new equation (19) may be used to elim¬ 
inate w, and (18) becomes 

6(1 — 3£i — 2 z) — lx — 5z = 2; 

that is, 

lx + 18*! + 172: = 4. (20) 

Now as before we introduce a new variable 

x —f— 2*i —J— 2 z = *2 

and eliminate x from ( 20 ): 

7*2 + 4*1 + 32 = 4. ( 21 ) 

Another new variable can be introduced in the same fashion, in order to eliminate 
the variable 2 , which has the smallest coefficient: 

2*2 + *1 + 2 = * 3 . 

Eliminating 2 from (21) yields 

*2 + *1 + 3*3 = 4, (22) 

and this equation, finally, can be used to eliminate * 2 . We are left with two 
independent variables, t\ and * 3 ; substituting back for the original variables, we 
obtain the general solution 

w = 17 — 5*i — 14*3, 

x= 20— 5*1 — 17*3, 
y = —55 + 19*i + 45* 3 , 

2 = —8 + *1 + 7 * 3 . 


( 23 ) 





4 . 5.2 


THE GREATEST COMMON DIVISOR 327 


In other words, all integer solutions (w,x,y,z) to the original equations (17), 
(18) are obtained from (23) by letting ti and £3 independently run through all 
integers. 

The general method that has just been illustrated is based on the following 
procedure: Find a nonzero coefficient c of smallest absolute value in the system 
of equations. Suppose that this coefficient appears in an equation having the 
form 

cx 0 -f- C\X\ -{-•••-(- CfcX/c = d ; (24) 

assume for simplicity that c > 0. If c — 1, use this equation to eliminate 
the variable Xo from the other equations remaining in the system; then repeat 
the procedure on the remaining equations. (If no more equations remain, the 
computation stops, and a general solution in terms of the variables not yet 
eliminated has essentially been obtained.) If c > 1, then if cimodc = ■•• = 
Ck mod c = 0 check that dmod c = 0 , otherwise there is no integer solution; then 
divide both sides of (24) by c and eliminate x 0 as in the case c = 1. Finally, 
if c > 1 and not all of cimodc, ..., cimodc are zero, then introduce a new 
variable 

[c/cjxo + \ci/c\xi -\ -b [Ck/c\x k = £; (25) 

eliminate the variable xq from the other equations, in favor of £, and replace the 
original equation (24) by 

ct -b (c 1 mod c)x 1 -j -+ (cfc mod c)x k = d. (26) 

(Cf. (19) and ( 21 ) in the above example.) 

This process must terminate, since each step reduces either the number of 
equations or the size of the smallest nonzero coefficient in the system. A study of 
the above procedure will reveal its intimate connection with Euclid’s algorithm. 
The method is a comparatively simple means of solving linear equations when 
the variables are required to take on integer values only. It isn’t the best available 
method for this problem, however; substantial refinements are possible, but 
beyond the scope of this book. 

High-precision calculation. If u and v are very large integers, requiring a multiple- 
precision representation, the binary method (Algorithm B) is a simple and fairly 
efficient means of calculating their greatest common divisor, since it involves 
only subtractions and shifting. 

By contrast, Euclid’s algorithm seems much less attractive, since step A2 
requires a multiple-precision division of u by v. But this difficulty is not really 
as bad as it seems, since we will prove in Section 4.5.3 that the quotient [u/v\ 
is almost always very small; for example, assuming random inputs, the quotient 
[u/v j will be less than 1000 approximately 99.856 percent of the time. Therefore 
it is almost always possible to find [u/v J and (ttmodr) using single-precision 
calculations, together with the comparatively simple operation of calculating 
u — qv where q is a single-precision number. Furthermore, if it does turn out 
that u is much larger than v (e.g., the initial input data might have this form), 
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we don’t really mind having a large quotient q, since Euclid’s algorithm makes 
a great deal of progress when it replaces u by u mod v in such a case. 

A significant improvement in the speed of Euclid’s algorithm when high- 
precision numbers are involved can be achieved by using a method due to D. H. 
Lehmer [AMM 45 (1938), 227-233]. Working only with the leading digits of large 
numbers, it is possible to do most of the calculations with single-precision arith¬ 
metic, and to make a substantial reduction in the number of multiple-precision 
operations involved. We save a lot of time by doing a “virtual” calculation 
instead of the actual one. 

For example, let us consider the pair of eight-digit numbers u = 27182818, 
v = 10000000, assuming that we are using a machine with only four-digit words. 
Let u' — 2718, v' = 1001, u" = 2719, v" = 1000; then u'/v' and u”/v" are 
approximations to u/v, with 


u'/v' < u/v < u"/v". (27) 

The ratio u/v determines the sequence of quotients obtained in Euclid’s algo¬ 
rithm. If we carry out Euclid’s algorithm simultaneously on the single-precision 
values ( u',v') and ( u",v") until we get a different quotient, it is not difficult to 
see that the same sequence of quotients would have appeared to this point if 
we had worked with the multiple-precision numbers (u, v). Thus, consider what 
happens when Euclid’s algorithm is applied to (u',v') and to ( u",v"): 


u' 

v' 

q' 

u" 

v" 

q" 

2718 

1001 

2 

2719 

1000 

2 

1001 

716 
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1000 

719 

1 

716 

285 

2 

719 

281 

2 

285 

146 
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281 

157 

1 

146 

139 

1 

157 

124 

1 

139 

7 

19 

124 

33 

3 


The first five quotients are the same in both cases, so they must be the true ones. 
But on the sixth step we find that q' q", so the single-precision calculations 
are suspended. We have gained the knowledge that the calculation would have 
proceeded as follows if we had been working with the original multiple-precision 
numbers: 


u v q 

uq vo 2 

V 0 U 0 — 2Vq 1 

Uo — 2vq —Uo 2 

—Uq -(- 3^0 3Uo — 8t>0 1 

3u 0 — 8v 0 —4u 0 + H^o 1 

— 4u 0 -f llv 0 7uq — 19i; 0 ? 


(28) 


(The next quotient lies somewhere between 3 and 19.) No matter how many digits 
are in u and v, the first five steps of Euclid’s algorithm would be the same as (28), 
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so long as (27) holds. We can therefore avoid the multiple-precision operations 
of the first five steps, and replace them all by a multiple-precision calculation of 
—4uo +11 v 0 and lu 0 — 19v 0 . In this case we obtain u = 1268728, v — 279726; 
the calculation can now proceed in a similar manner with u' — 1268, v' — 280, 
u" = 1269, v" — 279, etc. If we had a larger accumulator, more steps could be 
done by single-precision calculations; our example showed that only five cycles 
of Euclid’s algorithm were combined into one multiple step, but with (say) a 
word size of 10 digits we could do about twelve cycles at a time. (Results proved 
in Section 4.5.3 imply that the number of multiple-precision cycles that can be 
replaced at each iteration is essentially proportional to the number of digits used 
in the single-precision calculations.) 

Lehmer’s method can be formulated as follows: 

Algorithm L ( Euclid’s algorithm for large numbers). Let u,v be nonnegative 
integers, with u > v, represented in multiple precision. This algorithm computes 
the greatest common divisor of u and v , making use of auxiliary single-precision 
p-digit variables u, v , A, B, C, D, T , q , and auxiliary multiple-precision variables 
t and w. 

LI. [Initialize.] If v is small enough to be represented as a single-precision 
value, calculate gcd(v, v ) by Algorithm A and terminate the computation. 
Otherwise, let u be the p leading digits of u , and let v be the corresponding 
digits of v; in other words, if radix -6 notation is being used, u <— [u/b k \ and 
v <— [v/b k \, where k is as small as possible consistent with the condition 
u < b p . 

Set A <— 1, B <— 0, C <— 0 , D <— 1 . (These variables represent the 
coefficients in (28), where 

u = AuqBvq, v = Cuo -f- Dvo, (29) 

in the equivalent actions of Algorithm A on multiple-precision numbers. We 
also have 

u' = u + £, v* = v -f D, u" = u-\- A, v" = v + C (30) 

in terms of the notation in the example worked above.) 

L2. [Test quotient.] Set q 4 - |_(u -f- A)/(v -f C)\. If < 77 ^ [(u-\-B)/(vD) J, go to 
step L4. (This step tests if q' 7 ^ q ", in the notation of the above example. 
Note that single-precision overflow can occur in special circumstances during 
the computation in this step, but only when u = b p — 1 and A = 1 or when 
v = b p — 1 and D = 1; the conditions 

0 < ii-\- A < b p , 0 < v C < b p , 

~ ^ ~ ~ 31 

0<fi + S< 6 p , 0<v + D<b p 

will always hold, because of (30). It is possible to have v -f- C = 0 or 
v -f D = 0, but not both simultaneously; therefore division by zero in this 
step is taken to mean “Go directly to L4.”) 
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L3. [Emulate Euclid.] Set T <— A — qC, A <— C, C *— T, T B — qD , 
B<— D, D+— T,T<— u — qv, u <— v, v <— T, and go back to step L2. 
(These single-precision calculations are the equivalent of multiple-precision 
operations, as in (28), under the conventions of (29).) 

L4. [Multiprecision step.] If B = 0, set t <— u modi', u <— v, v <— t, using 
multiple-precision division. (This happens only if the single-precision opera¬ 
tions cannot simulate any of the multiple-precision ones. It implies that 
Euclid’s algorithm requires a very large quotient, and this is an extremely 
rare occurrence.) Otherwise, set t <— Au, t <— t-\-Bv, w <— Cu, w <— w-\-Dv, 
u <— t, v <— w (using straightforward multiple-precision operations). Go 
back to step LI. | 

The values of A, B , C , D remain as single-precision numbers throughout 
this calculation, because of (31). 

Algorithm L requires a somewhat more complicated program than Algo¬ 
rithm B, but with large numbers it will be faster on many computers. The 
binary technique of Algorithm B can, however, be speeded up in a similar 
way (see exercise 34), to the point where it continues to win. Algorithm L 
has the advantage that it can readily be extended, as in Algorithm X (see 
exercise 17); furthermore, it determines the sequence of quotients obtained in 
Euclid’s algorithm, and this yields the regular continued fraction expansion of a 
real number (see exercise 4.5.3-18). 

Analysis of the binary algorithm. Let us conclude this section by studying the 
running time of Algorithm B, in order to justify the formulas stated earlier. 

An exact determination of the behavior of Algorithm B appears to be ex¬ 
ceedingly difficult to derive, but we can begin to study it by means of an ap¬ 
proximate model of its behavior. Suppose that u and v are odd numbers, with 
u > v and 

[lguj = m, [lgvj = n. (32) 

(Thus, u is an (m-|-l)-bit number, and v is an (n-(-l)-bit number.) Algorithm B 
forms u — v and shifts this quantity right until obtaining an odd number u' that 
replaces u. Under random conditions, we would expect to have v! = (u — v)/2 
about one-half of the time, v! = (u — v)/4 about one-fourth of the time, v! = 
(u — v)/S about one-eighth of the time, and so on. We have 

[lg u f \ = m — k — r, (33) 

where k is the number of places that u — v is shifted right, and where r is 
[lguj — [lg(u — v)J, the number of bits lost at the left during the subtraction 
of v from u. Note that r < 1 when m > n -\- 2, and r > 1 when m = n. For 
simplicity, we will assume that r = 0 when my^n and that r = 1 when m = n, 
although this lower bound tends to make u' seem larger than it usually is. 

The approximate model we shall use to study Algorithm B is based solely 
on the values m = [lg u\ and n = [lg v\ throughout the course of the algorithm, 
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not on the actual values of u and v. Let us call this approximation a lattice-point 
model, since we will say that we are “at the point (m,n)” when [lguj = m and 
Llg = n. From point (m,n) the algorithm takes us to ( m',n ) if u > v, or to 
if u < v, or terminates if u = v. For example, the calculation starting 
with u = 20451 and v = 6035 begins at the point (14,12), then goes to (9,12), 
(9,11), (9,9), (4,9), (4,5), (4,4), and terminates. In line with the comments 
of the preceding paragraph, we will make the following assumptions about the 
probability that we reach a given point just after point (m,n): 


Case 1, 

to > n. 

Case 2, 

to < n. 

Next point 

Probability 

Next point 

Probability 

(to — 1, n) 

i 

(to, n — 1) 

i 

(to — 2, n) 

i 

4 

(to, n — 2) 

i 

4 

(l,n) 

(ir _1 

(to, 1) 


(0,n) 

ar- 1 

(to, 0) 

(hr- 1 


Case 3, m 
Next point 

(to — 2 , n), (m, n — 2) 

(to — 3, n), (m, n — 3) 

(0,n), (m, 0) 
terminate 

For example, from points (5,3) the lattice-point model would go to points 
(4,3), (3,3), (2,3), (1,3), (0,3) with the respective probabilities J, J, 

A; from (4,4) it would go to (2,4), (1,4), (0,4), (4,2), (4,1), (4,0), or would 
terminate, with the respective probabilities |, J. When m and 

n are both 0, the formulas above do not apply; the algorithm always terminates 
in such a case, since m = n = 0 implies that u = v = 1. 

The detailed calculations of exercise 18 show that this lattice-point model 
is somewhat pessimistic. In fact, when m > 3 the actual probability that 
Algorithm B goes from (to, to) to one of the two points (to — 2, m) or (m, m — 2) 
is equal to J, although we have assumed the value J; the algorithm actually goes 
from (to, m) to (to—3, to) or (to, m —3) with probability not J; and it actually 
goes from (to -f l,m) to (to,to) with probability not The probabilities in 
the model are nearly correct when |to — n| is large, but when |to — n\ is small 
the model predicts slower convergence than is actually obtained. In spite of the 
fact that our model is not a completely faithful representation of Algorithm B, 
it has one great virtue: It can be completely analyzed! Furthermore, empirical 
experiments with AJgorithm B show that the behavior predicted by the lattice- 
point model is analogous to the true behavior. 


= n > 0. 


Probability 


l 

i 

g> 


m 


dr- 1 
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An analysis of the lattice-point model can be carried out by solving the 
following rather complicated set of double recurrence relations: 


-Amm 

Amn 

Amn 


& H- 2^ m ( m —2) H - 


—(—--A m o - t* i if m 4” 1: 

1 2 m — 1 2 m — 1 — 

4--~Ai n -4---Ag^, if m. n 0: 

1 om —i 171 1 2 m__1 — 


= A. 


nm; 


if n > m > 0. (34) 


The problem is to solve for A mn in terms of m, n, and the parameters a, 6, 
c, and Aoo- This is an interesting set of recurrence equations, which have an 
interesting solution. 

First we observe that if 0 < n < m, 

A(rn+l)n — C ~f" ^ ^ 2 A( m _|_i—fc)n -}- 2 Ag n 

1 < fc < m 

— c ~h \\nn “h 2 1 A( m —k)n 2 771 Aon 

1 < fc < m 

= c -(- ^A mn -f- ^(Amn c) 

= ^ ~f* A mn . 

Hence A( m +*;) n = ^cfc -f- A mn , by induction on fc. In particular, since Aio = 
c -f- Ago, we have 


Amo — ^c(m —f-1) —Ago, m > 0. (35) 

Now let A m ~ A mm . If m > 0, we have 

A( m +i) m = e -(- ^ 2 fc A( m+1 _ fc ) m -f- 2 —m— ' x Ag m 

1 <fc<m+l 

= c + \Amm + (2 fc 1 (A( m —fc)(m-i-l) — c /2)) + 2 — m ~ 1 A Grn 

1 < fc < m 

— c A ?A m -f- ^(A( m _|_i)( m ^_i) — a — 2 ~ m b) — |c(l — 2 ~ m ) 

+ + l) + Aoo) 

— hi-A-m "I - An+l) "1“ | c — 2 m *(c— 6 + Aoo) + Wl2 —m —2 C. 

(36) 

Similar maneuvering, as shown in exercise 19, establishes the relation 
An -\-1 — |A n -f ^An—i ~h 2 n—1 /? -f- (n -f- 2)2 1 n > 2 , (37) 

where a = -f- Jc, (3 = Ago — b — §c, and 7 = Jc. 
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Thus the double recurrence (34) can be transformed into the single recurrence 

relation in (37). Use of the generating function G(z) = A -j- A^z A A 2 z 2 4- 

now transforms (37) into the equation 


(1 - - \z*)G(z) - a 0 +a 1 z+a 2 z 2 + T ^ + T -^j- 2 + -_/ /2)2 > (38) 

where do, a i, and a 2 are constants that can be determined by the values of A 0 , 
Ax, and A 2 . Since 1 — \z-\- \z 2 = (1 A \z){ 1 — z), we can express G(z) by the 
method of partial fractions in the form 


G(z) — bo -(- b\z A 


A 


^3 


A 


A 


A 


^6 


(1 — z ) 2 1 — 2 (1 — zf 2) 2 1 — z /2 l-f 2 / 4 ’ 


Tedious manipulations produce the values of these constants bo, bo, and thus 
the coefficients of G(z) are determined. We finally obtain the solution 


A™ = As a A m c ) A (if a A ib — §o c A §Ao) 

4- 2~ n (—J cn 4- %b — i c — §Ao) 

A ( — i) n ( — if a — if & + A if Ao) 4- i^no; 

Ann = i mc A n (f a A i c ) A (if & A §6 A A §Ao) A 2~ n (ic) 

A (—i) n (— zs a — §b A A 5 Ao), m > n. (39) 


With these elaborate calculations done, we can readily determine the be¬ 
havior of the lattice-point model. Assume that the inputs u and v to the algo¬ 
rithm are odd, and let m = [lgitj, n = [lgi;J. The average number of subtrac¬ 
tion cycles, namely the quantity C in the analysis of Program B, is obtained by 
setting a = l ,6 = 0,c = l, Ao — 1 in the recurrence (34). By (39) we see that 
(for m > n) the lattice model predicts 


C — \m A A to — i<Am (40) 

subtraction cycles, plus terms that rapidly go to zero as n approaches infinity. 

The average number of times that gcd(u, v) — 1 is obtained by setting a = 
b = c = 0, Ao = 1; this gives the probability that u and v are relatively prime, 
approximately §. Actually, since u and v are assumed to be odd, they should 
be relatively prime with probability S/n 2 (see exercise 13), so this reflects the 
degree of inaccuracy of the lattice-point model. 

The average number of times that a path from (m, n) goes through one of 
the “diagonal” points for some m' > 1 is obtained by setting a = 1, 

b = c = Ao = 0 in (34); so we find that this quantity is approximately 

i n A zs A §<Am, when m > n. 


Now we can determine the average number of shifts, i.e., the number of times 
that step B3 is performed. (This is the quantity D in Program B.) In any 
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Table 1 

NUMBER OF SUBTRACTIONS IN ALGORITHM B 


n n 



0 

1 

2 

3 

4 

5 

0 

1 

2 

3 

4 

5 


0 

1.00 

2.00 

2.50 

3.00 

3.50 

4.00 

1.00 

2.00 

2.50 

3.00 

3.50 

4.00 

0 

1 

2.00 

1.00 

2.50 

3.00 

3.50 

4.00 

2.00 

1.00 

3.00 

2.75 

3.63 

3.94 

1 

2 

2.50 

2.50 

2.25 

3.38 

3.88 

4.38 

2.50 

3.00 

2.00 

3.50 

3.88 

4.25 

2 

3 

3.00 

3.00 

3.38 

3.25 

4.22 

4.72 

3.00 

2.75 

3.50 

2.88 

4.13 

4.34 

3 

4 

3.50 

3.50 

3.88 

4.22 

4.25 

5.10 

3.50 

3.63 

3.88 

4.13 

3.94 

4.80 

4 

5 

4.00 

4.00 

4.38 

4.72 

5.10 

5.19 

4.00 

3.94 

4.25 

4.34 

4.80 

4.60 

5 

m 


Predicted 

by model 



Actual average values 


m 


execution of Algorithm B, with u and v both odd, the corresponding path in the 
lattice model must satisfy the relation 

number of shifts -f- number of diagonal points 2[lg gcd(u, v)J = m + 

since we are assuming that r in (33) is always either 0 or 1. The average value 
of [lggcd(u,v)j predicted by the lattice-point model is approximately | (see 
exercise 20). Therefore we have, for the total number of shifts, 


D — m n — (Jn -f- ^ -f- |<5mn) — | 

— m — §§ — §<5mn> 


(41) 


when m > n, plus terms that decrease rapidly to zero for large n. 

To summarize the most important facts we have derived from the lattice- 
point model, we have shown that if u and v are odd and if [lg u\ = m, |_lg uj = n, 
then the quantities C and D that are the critical factors in the running time of 
Program B will have average values given by 

C — Jm 4* §n -)- 0(1), D — m -j- |n + 0(1), m > n. (42) 

But the model that we have used to derive (42) is only a pessimistic approxima¬ 
tion to the true behavior; Table 1 compares the true average values of C, com¬ 
puted by actually running Algorithm B with all possible inputs, to the values 
predicted by the lattice-point model, for small m and n. The lattice model is 
completely accurate when m or n is zero, but it tends to be less accurate when 
|m — n\ is small and min(m, n) is large. When m = n = 9, the lattice-point 
model gives C = 8.78, compared to the true value C = 7.58. 

Empirical tests of Algorithm B with several million random inputs and with 
various values of m,n in the range 29 < m,n < 37 indicate that the actual 
average behavior of the algorithm is given by 

C & im + 0.203n + 1.9 — 0.4(0.6) m ~ n , 

D « m + 0.41n - 0.5 — 0.7(0.6) m ~ n 


j 


m > n. 


(43) 
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These experiments showed a rather small standard deviation from the observed 
average values. The coefficients \ and 1 of m in (42) and (43) can be verified 
rigorously without using the lattice-point approximation (see exercise 21); so the 
error in the lattice-point model is apparently in the coefficient of n, which is too 
high. 

The above calculations have been made under the assumption that u and v 
are odd and in the ranges 2 m < u < 2 m + 1 and 2 n < v < 2 n+1 . If we 
assume instead that u and v are to be any integers, independently and uniformly 
distributed over the ranges 

1 < u < 2 n , 1 < v < 2 n , 

then we can calculate the average values of C and D from the data already given; 
in fact, if C mn denotes the average value of C under our earlier assumptions, 
exercise 22 shows that we have 

( 2 N — 1 ) 2 C = N 2 Coo + 27V Y (■W-n)2"- 1 C„ 0 

l<n<N 

+ 2 Y, (N -rn)(N -n)2 m+n - 2 C mn 

1 <n<m< N 

+ Y (N-n) 2 2 2 n ~ 2 Cnn- (44) 

l<n<N 

The same formula holds for D in terms of D mn . If the indicated sums are carried 
out using the approximations in (43), we obtain 

C « 0.70AT + 0(1), D « 1.41JV + 0(1). 

(See exercise 23.) This agrees perfectly with the results of further empirical tests, 
made on several million random inputs for N < 30; the latter tests show that 
we may take 

C = 0.70N — 0.5, D — 1.411V — 2.7 (45) 

as good estimates of the values, given this distribution of the inputs u and v. 

Richard Brent has discovered a continuous model that accounts for the 
leading terms in (45). Let us assume that u and v are large, and that the number 
of shifts per iteration has the value d with probability exactly 2~~ d . If we let 
X = u/v, the effect of steps B3-B5 is to replace X by (X —l)/2 d ifX > 1, orby 
2 d /(X~ 1 — 1) if X < 1. The random variable X has a limiting distribution that 
makes it possible to analyze the average value of the ratio by which ma x(u, v) 
decreases at each iteration; see exercise 25. Numerical calculations show that this 
maximum decreases by (3 — 0.705971246102 bits per iteration; the agreement 
with experiment is so good that Brent’s constant (5 must be the true value of 
the number “0.70” in (45), and we should replace 0.203 by 0.206 in (43). [See 
Algorithms and Complexity, ed. by J. F. Traub (New York: Academic Press, 
1976), 321-355.] 
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This completes our analysis of the average values of C and D. The other 
three quantities appearing in the running time of Algorithm B are rather easily 
analyzed; see exercises 6, 7, and 8. 

Thus we know approximately how Algorithm B behaves on the average. Let 
us now consider a “worst case” analysis: What values of u and v are in some 
sense the hardest to handle? If we assume as before that 

[lg u\ = m and [lg v\ = n, 

let us try to find (u,v) that make the algorithm run most slowly. In view of 
the fact that the subtractions take somewhat longer than the shifts, when the 
auxiliary bookkeeping is considered, this question may be rephrased by asking 
for the inputs u and v that require most subtractions. The answer is somewhat 
surprising; the maximum value of C is exactly 

max(m, n) + 1, (46) 

although the lattice-point model would predict that substantially higher values 
of C are possible (see exercise 26). The derivation of the worst case (46) is quite 
interesting, so it has been left as an amusing problem for the reader to work out 
by himself (see exercises 27 and 28). 


EXERCISES 

1. [M21] How can ( 8 ), (9), (10), (11), and (12) be derived easily from ( 6 ) and (7)? 

2 . [M22] Given that u divides v\v 2 ... v n , prove that u divides 

gcd (u, vi) gcd (u, v 2 )... gcd(u, v n ). 

3. [ M28 ] Show that the number of ordered pairs of positive integers ( u, v ) such that 
lcm(u, v) = n is the number of divisors of n 2 . 

4 . [ M21 ] Given positive integers u and v, show that there are divisors u' of u and 
v' of v such that gcd(u', v') = 1 and u'v' — lcm(u, v). 

► 5. [M26] Invent an algorithm (analogous to Algorithm B) for calculating the greatest 
common divisor of two integers based on their balanced ternary representation. Dem¬ 
onstrate your algorithm by applying it to the calculation of gcd(40902, 24140). 

6 . [M22] Given that u and v are random positive integers, find the mean and the 
standard deviation of the quantity A that enters into the timing of Program B. (This 
is the number of right shifts applied to both u and v during the preparatory phase.) 

7 . [M20] Analyze the quantity B that enters into the timing of Program B. 

► 8 . [M25] Show that in Program B, the average value of E is approximately equal to 
\ C*ave ? where Gave is the average value of C. 

9. [18] Using Algorithm B and hand calculation, find gcd(31408, 2718). Also find 
integers m and n such that 31408m + 2718n = gcd(31408, 2718), using Algorithm X. 
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► 10 . [ HM24 ] Let q n be the number of ordered pairs of integers (u, v ) lying in the range 
1 <. u,v <. n such that gcd(u, v) = 1. The object of this exercise is to prove that we 
have lim n ->oo Qn/ri 2 = 6/7T 2 , thereby establishing Theorem D. 

a) Use the principle of inclusion and exclusion (Section 1.3.3) to show that 

q n = n 2 — J^Ln/piJ 2 + ^ [n/p ip 2 J 2 -, 

Pl Pl <P2 


where the sums are taken over all prime numbers pi. 

b) The Mobius function p(n) is defined by the rules fi{ 1) = 1, p(pip 2 .. .p r ) = (—l) r 
if pi, p 2 , ..., p r are distinct primes, and p(n) = 0 if n is divisible by the square 
of a prime. Show that q n = Y2k>i ^{k)[n/k\ 2 . 

c) As a consequence of (b), prove that lim n -*oc qn/n 2 = Yh k> i M^)A 2 - 

d) Prove that (X^fc>i M(^)/fc 2 )(X]m>i 1 / ?tz 2 ) = 1. Hint: When the series are ab¬ 
solutely convergent we have 


y; a fc //c 2 V ^2 bmim z \ = y^f^cLdbn/d 

"fc>1 > 1 ' n > 1 ^ d\n 



11 . [ M22) What is the probability that gcd(u, v) < 3? (See Theorem D.) What is 
the average value of gcd(u, v )? 

12 . [M24] (E. Cesaro.) If u and v are random positive integers, what is the average 
number of (positive) divisors they have in common? [Hint: See the identity in exercise 
10 (d), with a k =b m = 1.] 

13 . [ HM2S ] Given that u and v are random odd positive integers, show that they are 
relatively prime with probability 8 / 7 r 2 . 

14 . [M26] What are the values of v\ and V 2 when Algorithm X terminates? 

► 15 . [M22] Design an algorithm to divide u by v modulo m, given positive integers u, 
v, and m, with v relatively prime to m. In other words, your algorithm should find w, 
in the range 0 < w < m, such that u = vw (modulo m). 

16 . [21] Use the text’s method to find a general solution in integers to the following 
sets of equations: 

a) 3x + ly 4- llz = 1 b) 3:r + 7y + II 2 = 1 

5x-\-ly — 5z = 3 5m —|— 7 y — 5 2 = —3 

► 17 . [M24] Show how Algorithm L can be extended (as Algorithm A was extended to 
Algorithm X) to obtain solutions of (15) when u and v are large. 

18. [M37] Let u and v be odd integers, independently and uniformly distributed in 
the ranges 2 m < u < 2 m_t_1 , 2 n < v < 2 n+1 . What is the exact probability that 
a single “subtract and shift” cycle in Algorithm B, namely an operation that starts 
at step B6 and then stops after step B5 is finished, reduces u and v to the ranges 
2 m < u < 2 m +1 , 2 n < v < 2 n +1 , as a function of m, n, m', and n r ? (This exercise 
gives more accurate values for the transition probabilities than the text’s model does.) 

19 . [M24] Complete the text’s derivation of (38) by establishing (37). 
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20. [ M26} Let X = |_lggcd(tt, v)J. Show that the lattice-point model gives X — 1 

with probability X = 2 with probability X = 3 with probability etc., plus 

correction terms that go rapidly to zero as u and v approach infinity; hence the average 
value of X is approximately |. [Hint: Consider the relation between the probability of 
a path from (m, n) to (k -f- 1 , k -)- 1 ) and a corresponding path from (m — k,n — k) 
to ( 1 , 1 ).] 

21 . [HM26] Let C mn and D mn be the average number of subtraction and shift cycles, 
respectively, in Algorithm B, when u and v are odd, [lg^J = m > Ug V J — n ■ Show that 
for fixed n, C mn — \m -j- 0(1) and D mn = m 0(1) as m —► oo. 

22 . [23] Prove Eq. (44). 

23. [M28] Show that if C mn = am -j- (3n -j- 7 for some constants a, (3, and 7 , then 

( N - m X N - n) 2 m+ "- 2 C m „ = 2 2N (£Ha + 0)N + 0(1)), 

1 <n <m < N 

Y, ( N - n) 2 2 2n - 2 C„„ = 2 2N (^(a + P)N + 0(1)). 

l<n<N 

► 24. [ M30] If v — 1 but u is large, during Algorithm B, it may take fairly long for the 
algorithm to determine that gcd(w, v) = 1. Perhaps it would be worthwhile to add a 
test at the beginning of step B5: “If t = 1 , the algorithm terminates with 2 fc as the 
answer.” Explore the question of whether or not this would be an improvement when 
the algorithm deals with random inputs, by determining the average number of times 
that step B 6 is executed with u = 1 or v = 1, using the lattice-point model. 

► 25. [ M26 ] (R. P. Brent.) Let u n and v n be the values of u and v after n iterations 

of steps B3-B5; let X n = u n /v n) and assume that F n (x) is the probability that X n < 
x, for 0 < x < oo. (a) Express F n+ i(x) in terms of F n (x), under the assumption 
that step B4 always branches to B3 with independent probability %. (b) Let G n (x) = 
F n (x ) 1 — F n (x~ 1 ) be the probability that Y n < x, for 0 < x < 1, where Y n = 

min(u n , i> n )/niax('u n , v n ). Express G n +i in terms of G n ■ (c) Express the distribution 
H n (x) — probability that max(u n +i, v n +i)/max(n n , v n ) < x in terms of G n - 

26. [M23] What is the length of the longest path from (m, n) to (0,0) in the lattice- 
point model? 

► 27. [ M28 ] Given m > n > 1, find values of u, v with [lg w J = wz, |_lg — n. such 
that Algorithm B requires m-\- 1 subtraction steps. 

28. [M37] Prove that the subtraction step B 6 of Algorithm B is never executed more 
than 1 -f- [lg max(u, i>)J times. 

29. [M30] Evaluate the determinant 

gcd(l, 1 ) gcd(l, 2 ) ... gcd(l,n) 

gcd( 2 , 1 ) gcd( 2 , 2 ) ... gcd( 2 , n) 

gcd(n, 1 ) gcd(n, 2 ) ... gcd(n,n) 

30. [ M25 ] Show that Euclid’s algorithm (Algorithm A) applied to two n-bit binary 
numbers requires 0(n 2 ) units of time, as n —► oo. (The same upper bound obviously 
holds for Algorithm B.) 
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31. [M22] Use Euclid’s algorithm to find a simple formula for gcd(2 m — l,2 n — 1) 
when m and n are nonnegative integers. 

32. [M43] Can the upper bound 0(n 2 ) in exercise 30 be decreased, if another algo¬ 
rithm for calculating the greatest common divisor is used? 

33. [. M 46 } Analyze V. C. Harris’s “binary Euclidean algorithm.” 

► 34. [ M32 ] (R. W. Gosper.) Demonstrate how to modify Algorithm B for large num¬ 
bers, using ideas analogous to those in Algorithm L. 

► 35. [M28\ (V. R. Pratt.) Extend Algorithm B to an Algorithm Y that is analogous 
to Algorithm X. 

36. [ HM49 ] Find a rigorous proof that Brent’s model describes the asymptotic be¬ 
havior of Algorithm B. 


*4.5.3. Analysis of Euclid’s Algorithm 

The execution time of Euclid’s algorithm depends on T, the number of 
times the division step A2 is performed. (See Algorithm 4.5.2A and Program 
4.5.2A.) The quantity T is also an important factor in the running time of other 
algorithms, such as the evaluation of functions satisfying a reciprocity formula 
(see Section 3.3.3). We shall see in this section that the mathematical analysis 
of this quantity T is interesting and instructive. 

Relation to continued fractions. Euclid’s algorithm is intimately connected with 
continued fractions , which are expressions of the form 


h 


a 1 -j- 


b‘2 _ 

H— 


bi /(<H“H> 2 /(fl- 2 -\~bs/(' ■'/[a n —\-\-b n /a n )...))). 



1 u n 

—11- 

tin 

Continued fractions have a beautiful theory that is the subject of several books. 
[See, for example, O. Perron, Die Lehre von den Kettenbriichen, 3rd ed. (Stutt¬ 
gart: Teubner, 1954), 2 vols.; A. Khinchin, Continued Fractions, tr. by Peter 
Wynn (Groningen: P. Noordhoff, 1963); H. S. Wall, Analytic Theory of Con¬ 
tinued Fractions (New York: Van Nostrand, 1948); and see also J. Tropfke, 
Geschichte der Elementar-Mathematik 6 (Berlin: Gruyter, 1924), 74-84, for the 
early history of the subject.) It is necessary to limit ourselves to a comparatively 
brief treatment of the theory here, studying only those aspects that give us more 
insight into the behavior of Euclid’s algorithm. 

The continued fractions of primary interest to us are those in which all of 
the b’s in (1) are equal to unity. For convenience in notation, let us define 


lx lf X 2 f ...,X n l = l/(xi + l/(*2 + l/(-**(x n _i + l/x n ). . .))). (2) 
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Thus, for example, 



lxi,x 2 l 


1 

Xi -f- 1 /£2 


x\x 2 + r 



If n = 0, the symbol fxi ,..., x n J is taken to mean 0. Let us also define the 
polynomials Q n (x 1 , £ 2 ,..., x n ) of n variables, for n > 0, by the rule 


I I, if n— 0 ; 

Xi, if n = 1 ; 

XiQn—l(X2, ...,X n )-\- Qn- 2 (^ 3 , • • , Z n ), if ^ > 1. 

( 4 ) 

Thus 62 (^ 1 , ^ 2 ) = xiX 2 + l,Q 3 (x\,X 2 ,xz) — xiX 2 Xs + Xi-\-X3, etc. In general, 
as noted by L. Euler in the eighteenth century, Q n (xi,x 2 ,..., x n ) is the sum of 
all terms obtainable by starting with X\X 2 . .,x n and deleting zero or more non- 
overlapping pairs of consecutive variables XjXj+i; there are F n +1 such terms. 
The polynomials defined in (4) are called “continuants.” 

The basic property of the Q-polynomials is that 


lx 1 , X 2 , .. . , X n f — Qn — l(x 2l * • • 3 X n )/ Qn(x 1, £ 2 ? • • • 3 X n J, 71 ^ 1. (5) 

This can be proved by induction, since it implies that 


Xo “f - jx 1 , • • • , X n f — Qn-Xi, . . . , X n )/Qn(x 1 , . . . , £n)> 

hence /xq, £ 1 ,..., x n / is the reciprocal of the latter quantity. 

The Q-polynomials are symmetrical in the sense that 

Qn(*^l) X 2j . . • , X n ) = Qn(^n> • • • 1 X 2 , £j). (6) 

This follows from Euler’s observation above, and as a consequence we have 

Qn[X\, • • • 3 X n ) X n Qn — l(^l) • • • > X n — 1 ) _ f~ Q n — 2 ^X\) . • • , X n — 2 ) (f) 

for n > 1. The Q-polynomials also satisfy the important identity 

Qn(x 1 , • • • > Xn)Qn{x 2? • • • 3 Xn-\-l) Qn-\-\{X\i • • • 3 )Qn— \{x 2 , • • • , X n ) 

= M) n , n > 1. (8) 


(See exercise 4.) The latter equation in connection with (5) implies that 



_J_L_ + _J_ + M TzL 

QoQl QlQ2 ~ <?2<?3 Qn-lQn ’ 

where q k = Q k {x 
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Thus the (^-polynomials are intimately related to continued fractions. 

Every real number X in the range 0 < X < 1 has a regular continued 
fraction defined as follows: Let X 0 = X, and for all n > 0 such that X n ^ 0 
let 

~ |_l/X n J, X n _|_x = 1/X n (10) 

If X n = 0, the quantities A n +x and X n +x are not defined, and the regular 
continued fraction for X is /Ax,..., A n /. If X n 0, this definition guarantees 
that 0 < < 1, so each of the A’s is a positive integer. The definition (10) 

clearly implies that 


Ai -f- Xi Ai -f" 1 /(A 2 ~b X 2 ) 

hence 

X = /Ai,..., A n _i, A n -f- X n J ( 11 ) 

for all n > 1, whenever X n is defined. In particular, we have X — /Ai,..., A n / 
when X n — 0. If X n 7 ^ 0, the number X always lies between jA 1 ,..., A n / and 
/Ax,..., An +1/, since by (7) the quantity q n = Q n (A 1} ..., A n +X n ) increases 
monotonically from Q n {Ai ,..., A n ) up to Q n (Ax,..., A n 4~ 1 ) as X n increases 
from 0 to 1, and by (9) the continued fraction increases or decreases when q n 
increases, according as n is even or odd. In fact, 

|X — /Ax,..., A„/| = |/Ax,.. •, A n 4~ X n ! — /Ax,..., A n /| 

= i /^l j ■ • • j > 1 /X n I I Ax, . . . , A n 1 1 

_ Qn(A 2 , . . . , A n , 1/X n ) Qn — l(A 2 ,...,A n ) 

Qn-\-l[A\, . . . , An, l/Xn) Qn(Ai, . . . , A n ) 

1/(Qn(Ax, • • ■ , A n )Qn-)-l(Ax, . . . , An, 1/X n )) 

^ l/(Qn(Ax, . . . , A n )Qn-(-l(Ai, . . . , An, A n -(-i)) (12) 

by (5), (7), ( 8 ), and ( 10 ). Therefore /Ax,..., A n / is an extremely close approx¬ 
imation to X, unless n is small. If X is irrational, it is impossible to have X n — 0 
for any n, so the regular continued fraction expansion in this case is an infinite 
continued fraction /Ax, A 2 , A 3 ,... /. The value of an infinite continued fraction 
is defined to be 

lim /Ax,A 2 ,...,An/, 

n —*-oo 

and from the inequality (12) it is clear that this limit equals X. 

The regular continued fraction expansion of real numbers has several prop¬ 
erties analogous to the representation of numbers in the decimal system. If we 
use the formulas above to compute the regular continued fraction expansions of 
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some familiar real numbers, we find, for example, that 

A = /3,1,1,1,2/; 

\/A = /l. 1. 9. 2, 2 , 3, 2 , 2 , 9, 1 , 2 , 1 ,9, 2 , 2 , 3, 2 , 2 , 9, 1 , 2 , 1 , 9, 2 , 2 , 3, 2 , 2 , 9, 1 ,... /; 
sfl = 1 + /3, 1, 5,1,1, 4,1,1,8 , 1,14,1,10,2, 1,4, 12,2, 3, 2,1, 3,4, 1,1,2,14, 3, . .. /; 

7T = 3 + /7,15,1,292,1,1,1,2,1,3,1,14,2,1,1,2,2,2,2,1,84,2,1,1,15,3,13,... /; 

£ = 2 + /l, 2,1,1,4,1,1,6,1,1,8,1,1,10,1,1,12,1,1,14,1,1,16,1,1,18,1,.., 

1 = /l, 1,2,1,2,1,4,3,13,5,1,1,8,1,2,4,1,1,40,1,11,3,7,1,7,1,1, 5,.../; 

4>= 1 + /1,1,1,1,1,1,1,1,1,.../. (13) 


The numbers A 1} A 2 , ... are called the partial quotients of X. Note the regular 
pattern that appears in the partial quotients for y/8/29, 0 , and e; the reasons for 
this behavior are discussed in exercises 12 and 16. There is no apparent pattern 
in the partial quotients for \/2, n, or 

It is interesting to note that the ancient Greeks’ first definition of real 
numbers, once they had discovered the existence of irrationals, was essentially 
stated in terms of infinite continued fractions. (Later they adopted the suggestion 
of Eudoxus that x = y should be defined instead as “2 < r if and only if y < r, 
for all rational r.”) See O. Becker, Quellen und Studien zur Geschichte Math., 
Astron., Physik (B) 2 (1933), 311-333. 

When X is a rational number, the regular continued fraction corresponds 
in a natural way to Euclid’s algorithm. Let us assume that X = v/u, where 
u > v > 0. The regular continued fraction process starts with Xo = X; let us 
define U 0 =u,V 0 = v. Assuming that X n = V n /U n 7 ^ 0, (10) becomes 

A n+1 - lU n /Vn\ t 

(14) 

x n+1 = U n /V n -A n+1 = (Un modV n )/V n . 

Therefore, if we define 


U n -\-l — ^n, — U n mod Vn, (15) 

the condition X n = V n /U n holds throughout the process. Furthermore, (15) is 
precisely the transformation made on the variables u and v in Euclid’s algorithm 
(see Algorithm 4.5.2A, step A 2 ). For example, since ^ = /3, 1 , 1 , 1 ,2/, we know 
that Euclid’s algorithm applied to u = 29 and v — 8 will require exactly five 
division steps, and the quotients [u/v J in step A 2 will be successively 3, 1, 1, 1, 
and 2. Note that the last partial quotient A n must be 2 or more when X n = 0 
and n > 1 , since X n _i is less than unity. 

From this correspondence with Euclid’s algorithm we can see that the regular 
continued fraction for X terminates at some step with X n = 0 if and only if 
X is rational; for it is obvious that X n cannot be zero if X is irrational, and, 
conversely, we know that Euclid’s algorithm always terminates. If the partial 
quotients obtained during Euclid’s algorithm are A lf A 2 , ..., A n , then we have, 
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by (5), 

_ Qn—l{A 2 , • ■ • jA n ) flfi) 

U Q n (Ai, A 2 , . . . , A n ) 

This formula holds also if Euclid’s algorithm is applied for u < v, when A\ = 0. 
Furthermore, because of (8), Q n —i(A 2 , ..., A n ) and Q n (Ai, A 2 , ... ,A n ) are rela¬ 
tively prime, and the fraction on the right-hand side of (16) is in lowest terms; 
therefore 

u — Qn{Ai , A 2 , ..., A n Jd, v = Q n — i{A 2 , . • •, A n )d, (17) 

where d = gcd(ii, v). 

The worst case. We can now apply these observations to determine the behavior 
of Euclid’s algorithm in the “worst case,” or in other words to give an upper 
bound on the number of division steps. The worst case occurs when the inputs 
are consecutive Fibonacci numbers: 

Theorem F (G. Lame, 1845). For n > let u and v be integers with u > v > 0 
such that Euclid’s algorithm applied to u and v requires exactly n division steps, 
and such that u is as small as possible satisfying these conditions. Then u — 
F n + 2 and v = F n+1 . 

Proof. By (17), we must have u = Q n (Ai,A 2 ,..., A n )d, where Ai, A 2 , 

An, and d are positive integers and A n > 2. Since Q n is a polynomial with 
nonnegative coefficients, involving all of the variables, the minimum value is 
achieved only when Ai = 1, , A n —i = 1, A n = 2, d = 1. Putting these 

values in (17) yields the desired result. | 

(This theorem has the historical claim of being the first practical application 
of the Fibonacci sequence; since then many other applications of Fibonacci 
numbers to algorithms and to the study of algorithms have been discovered.) 

As a consequence of Theorem F we have an important corollary: 

Corollary L. If 0 < u, v < IV, the number of division steps required when 
Algorithm 4.5.2A is applied to u and v is at most [log ^(y/SN)] — 2. 

Proof. By Theorem F, the maximum number of steps, n, occurs when u — F n 
and v = F n _pi, where n is as large as possible with F n +i < N. (The first 
division step in this case merely interchanges u and v when n > 1.) Since 
F n+ 1 < N, we have 0 n+1 /\/5 < N (see Eq. 1.2.8-15), so n-f-l < log^(\/5 IV). 
This completes the proof. | 

Note that log0(\/5 N) is approximately 2.078 In iV -f- 1.672 ^ 4.785 log 10 N + 
1.672. See exercises 31 and 36 for extensions of Theorem F. 
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An approximate model. Now that we know the maximum number of division 
steps that can occur, let us attempt to find the average number. Let T(m , n) be 
the number of division steps that occur when u — m and v = n are input to 
Euclid’s algorithm. Thus 

T(m, 0) = 0; T(m, n) = 1 -j- T(n, mmodn) if n > 1 . (18) 

Let T n be the average number of division steps when v = n and when u is chosen 
at random; since only the value of umodi; affects the algorithm after the first 
division step, we may write 


T n = - T(k,n). (19) 

^ 0<fc<n 

For example, T(0,5) = 1, T{ 1, 5) = 2 , T{ 2 ,5) = 3, T(3, 5) = 4, T{ 4, 5 ) = 3, so 

T 5 = i(l + 2 + 3 + 4 + 3) = 2 f. 

In order to estimate T n for large n, let us first try an approximation suggested 
by R. W. Floyd: We might assume that, for 0 < k < n, the value of n is 
essentially “random” modulo k, so that we can set 

T n ~ 1 H— (To H— Ti —|--f- T n —i). 

Tb 

Then T n « S n , where the sequence { S n ) is the solution to the recurrence relation 
So = 0, S n = 1 -~|— (5o + Si -(-••• -f- S n — i), n > 1. (20) 

71/ 

(This approximation is analogous to the “lattice-point model” used to investigate 
Algorithm B in Section 4.5.2.) 

The recurrence (20) is readily solved by the use of generating functions. A 
more direct way to solve it, analogous to our solution of the lattice-point model, 
is by noting that 


Sn +1 — 1 H- Y~r (So + Si +-1- S n —i -(- S n ) 

7% I X. 

= 1 + -h- (n(S n - 1) + S n ) = + 

hence S n is 1 + J -j -(- J — H n , a harmonic number. The approximation 

T n ~ S n now suggests that T n « Inn -f- 0(1). 

Comparison of this approximation with tables of the true value of T n show, 
however, that In n is too large; T n does not grow this fast. One way to account for 
the fact that this approximation is too pessimistic is to observe that the average 





4 . 5.3 


ANALYSIS OF EUCLID’S ALGORITHM 345 


value of nmod k is less than the average value of J/c, in the range 1 < k < n: 
\ Y, («modfc) = f (n-qk) 

1 <k<n 1 <q< n 

V'n/{q-\-\)\<k<\n/q\ 



= f 1 — ) n + 0 (log n) ( 21 ) 

(cf. exercise 4.5.2-10(c)). This is only about .1775n, not .25n; so the value 
of n mod k tends to be smaller than the above model predicts, and Euclid’s 
algorithm works faster than we might expect. 

A continuous model. The behavior of Euclid’s algorithm with v = N is essen¬ 
tially determined by the behavior of the regular continued fraction process when 
X — 0 /N, 1/N, ..., (N — 1)/N. Assuming that N is very large, we are led 
naturally to a study of regular continued fractions when X is a random real 
number uniformly distributed in [ 0 , 1 ). Therefore let us define the distribution 
function 

F n (x) = probability that X n < x, for 0 < x < 1, (22) 

given a uniform distribution of X = Xo. By the definition of regular continued 
fractions, we have F 0 (x) = x, and 

F n ±i(x) = ^ probability that (k < 1/X n < k + x) 

k> 1 

= ^ probability that (l/(fc -f- x) < X n < 1/A;) 

fc>i 

= ^ (F„(l/fc) - F n (l/(k + *))). (23) 

fc>l 

If the distributions F 0 (x), Fi(x), ... defined by these formulas approach a limiting 
distribution Foo(x) = F(x), we will have 

F(x) = £ (F(l/k) - F(l/(k + *))). (24) 

fc>l 

One function that satisfies this relation is F(x) = log 6 (l + x), for any base 
b > 1 ; see exercise 19. The further condition F( 1 ) = 1 implies that we should 
take b = 2. Thus it is reasonable to make a guess that F(x) = lg(l -j- x), and 
that F n (x) approaches this behavior. 
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We might conjecture, for example, that F(\) — lg(§) zz 0.58496; let us see 
how close F n {\) comes to this value for small n. We have Fq(J) = J, and 


= i- - -f i — 

1 1 + 42 


2 + 


T + - 


= 2 


1 1,1 


3 + J - 5 +*••! = 2(1 - In 2) « 0.6137; 




----}--- - -f- 

-f- 2 3 m 2 4m -f- 2 5 m -f- 2 


= £ 

TO> 1 


m 2 V 2 3 1 4 


- y if_-_i_+ 

r “ 1 m V2m(2m + 2) 3m(3m + 2) 



In 2) — X) 

m> 1 


4S m 
m 2 ’ 


where 5 m = l/(4m + 4) — l/(9m -f- 6 ) -j- l/(16m + 8 ) — • • •. Using the values 
of H x for fractional x found in Table 3 of Appendix B, we find that 


Si — yj, ^2 — J — In 2, = J8 — 7 t/( 2 v / 3), 


etc.; a numerical evaluation yields F 2 (^) ^ 0.5748. Although Fi(x) = H x , it is 
clear that F n (x) is difficult to calculate exactly when n is large. 

The distributions F n (x) were first studied by K. F. Gauss, who thought of 
the problem in 1800. His notebook for that year lists various recurrence relations 
and gives a brief table of values, including the four-place value for F 2 (\) that has 
just been mentioned. After performing these calculations, Gauss wrote, “Tam 
complicatae evadunt, ut nulla spes superesse videatur,” i.e., “They come out so 
complicated that no hope appears to be left.” Twelve years later, Gauss wrote 
a letter to Laplace in which he posed the problem as one he could not resolve to 
his satisfaction. He said, “I found by very simple reasoning that, for n infinite, 
F n (x) — log(l + x)/\og2. But the efforts that I made since then in my inquiries 
to assign F n (x) — log(l + z)/log2 for very large but not infinite values of n 
were fruitless.” He never published his “very simple reasoning,” and it is not 
completely clear that he had found a rigorous proof. More than 100 years went 
by before a proof was finally published, by R. O. Kuz’min [Atti del Congresso 
internazionale dei matematici 6 (Bologna, 1928), 83-89], who showed that 


Fn(x) — lg(l + 1 ) + 0 (e —A ' / ”) 

for some positive constant A. The error term was improved to 0(e~ An ) by Paul 
Levy shortly afterward [Bull. Soc . Math, de France 57 (1929), 178-194]*; but 

*An exposition of Levy’s interesting proof appeared in the first edition of this book. 
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Gauss’s problem, namely to find the asymptotic behavior of F n (x) — lg(l + x), 
was not really resolved until 1974, when Eduard Wirsing published a beautiful 
analysis of the situation [Acta Arithmetica 24 (1974), 507-528]. We shall study 
the simplest aspects of Wirsing’s approach here, since his method is an instructive 
use of linear operators. 

If G is any function of x defined for 0 < x < 1 , let SG be the function 
defined by 

sow -S(°GFfe)) w 

Thus, S is an operator that changes one function into another. In particular, 
by (23) we have F n +i(x) = SF n (x), hence 

F n = S n F 0 . (26) 


(In this discussion F n stands for a distribution function, not for a Fibonacci 
number.) Note that S is a “linear operator”; i.e., S(cG) = c(SG) for all con¬ 
stants c, and S(Gi + G 2 ) = SGi SG 2 . 

Now if G has a bounded first derivative, we can differentiate (25) term by 
term to show that 


(SG)'(x) = £ 

fc>l 


1 c'f 1 Y 

(k -j- x ) 2 \k -\- x/ 


(27) 


hence SG also has a bounded first derivative. (Term-by-term differentiation 
of a convergent series is justified when the series of derivatives is uniformly 
convergent; cf. K. Knopp, Theory and Application of Infinite series (Glasgow: 
Blackie, 1951), §47.) 

Let H = SG, and let g(x) = (1 + x)G'(x), h(x) = (1 + x)H'(x). It follows 

that 

“ 5, (i+T+i - mMrh} 

In other words, h = Tg, where T is the linear operator defined by 


Tg(x) = ( 

fc>l ^ 


k 


k — 1 


k-\-1 -)-x k xJ \k-\-x 



1 


( 28 ) 
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Continuing, we see that if g has a bounded first derivative, we can differen¬ 
tiate term by term to show that Tg does also: 




k — 


+ x) 2 (k -f- x 


c) 2 } + x) 


k 4~ 


k _ k — l \ 1 

1 4" x kx J(kx) 2 ® \kx JJ 


SU + 14-^ 2 ({h2 


— Q 


k -\- 1 4 - x 


■ 1 + s / 1 U 

(fc 4- %) 3 {k 4-1 4- X )Q \fc 4~ X JJ ' 

There is consequently a third linear operator, U, such that (Tg)' = — U(g'), 
namely 


U< p( x ) X^(fc -f i 4- x) 2 //(fc+ 1+ x) 


l/(fc-fx) 




(fc 4- a^) 3 (fc + 1 + 4- z)y 

What is the relevance of all this to our problem? Well, if we set 

^n(*) = lg(l + l)+fln(lg(l+x)), (30) 

fn{x) = (1 + x)F’ n {x) = ^(1 + f4(lg(l + i))), (31) 

we have 

f n (x) = K"(lg(l 4- x))/ ((In 2) 2 (1 + x)); (32) 

the effect of the lg(l + x) term disappears, after these transformations. Further¬ 
more since F n = S n F 0 we have f n — T n f 0 and /' = (—1 ) n U n f 0 . Both F n 
and f n have bounded derivatives, by induction on n. Thus (32) becomes 

(-lfKOsU + *)) = (1 + *)(ln 2) 2 U n f 0 (x). (33) 

Now Fo(x) = x, fo(x ) = 1 -{- x, and f' Q (x) is the constant function 1 . We are 
going to show that the operator U n takes the constant function into a function 
with very small values, hence \R’f(x)\ must be very small for 0 < x < 1. Finally 
we can clinch the argument by showing that R n (x) itself is small: Since we have 
R n ( 0 ) = R n ( 1) = 1, it follows from a well-known interpolation formula (cf. 
exercise 4.6.4-15 with x 0 = 0, Xi = x, x 2 — 1) that 

R n (x) = - KJ(«*)) 

for some function £(x), where 0 < £(x) < 1 when 0 < x < 1 . 


( 34 ) 
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Thus everything hinges on our being able to prove that U n produces small 
function values, where U is the linear operator defined in (29). Note that U is 
a positive operator, in the sense that Up(x) > 0 for all x if p(x) > 0 for all x. 
It follows that U is order-preserving: If <pi(x) < <p 2 (x) for all x then we have 
Upi(x) < U(pi{x) for all x. 

One way to exploit this property is to find a function p for which we can 
calculate U<p exactly and to use constant multiples of this function to bound the 
ones that we are really interested in. First let us look for a function g such that 
Tg is easy to compute. If we consider functions defined for all x > 0, instead of 
only on [0,1], it is easy to remove the summation from (25) by observing that 

sgc+d-som _ c(^) - c(rb)-®«” <*> 

when G is continuous. Since T((l + x)G') = (1 + x)(SG)', it follows (see 
exercise 20 ) that 

Tg(x) _ Tg(x + 1) = /_1_1 \ { 1 

x + 1 x + 2 x~j~2j yl + x 

If we set Tg(x) = l/(x + 1)> we find that the corresponding value of g(x) is 
1 + x — 1/(1 + x )- Let p{x) — g'(x ) = 1 + 1/(1 + x ) 2 > so that U<p(x) = 
— (Tg)'(x) = 1/(1 + z) 2 ; this is the function <p we have been looking for. 

For this choice of p we have 2 < p(x)/Up(x) = (1 + x ) 2 H~ 1 ^ 5 for 
0 < x < 1 , hence 

\p <Up< ip. 

By the positivity of U and p we can apply U to this inequality again, obtaining 
< \Up< U 2 p < iUp < \p', and after n — 1 applications we have 

5 ~ n p < U n p < 2 ~ n p (37) 

for this particular p. Let xi x ) — fbi x ) = 1 be the constant function; then for 
0 < x < 1 we have f X ^ V ^ 2x> hence 

|5“ n x < i 5 ~ n ^ < iU n p < U n x < iU n p < | 2 ~ n p < f 2~ n x- 
It follows by (33) that 

|(ln2) 2 5” n < (-1 ) n Rl(x) < l^(ln 2 ) 2 2 —n , for 0 < x < 1; 

hence by (30) and (34) we have proved the following result: 

Theorem W. The distribution F n (x) equals lg(l + x) + 0( 2~ n ) as n oo. In 
fact, F n (x) — lg(l + x) lies between ^(—l) n + 1 5 _ " n (ln(l-)-x))(ln2/(l-l-x)) and 
|(—l) n+ i 2 _n ( l n ( 1 + x))(ln 2/(1 + t)), for 0 < x < 1 . I 
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With a slightly different choice of <p, we can obtain tighter bounds (see ex¬ 
ercise 21). In fact, Wirsing went much further in his paper, proving that 

F n (x) = lg(l + x) + (-X) n *(i) + 0(x( 1 - x)(X - 0.031)”), (38) 


where 


X = 0.30366 30028 98732 65860 ... 

- /3,3,2,2,3,13,1,174,1,1,1,2,2,2,1,1,1,2,2,1,... / 


is a fundamental constant (apparently unrelated to more familiar constants), and 
where # is an interesting function that is analytic in the entire complex plane 
except for the negative real axis from —1 to —oo. Wirsing’s function satisfies 
tf(0) = 'I'(l) = 0, #'(()) < 0, and SV = —X#; thus by (35) it satisfies the 
identity 

®(*)-*(* + l) =1*^X1 (40) 

Furthermore, Wirsing demonstrated that 

W—- 4- ) = cX~ n logAT + 0(l) asiV-+oo, (41) 

\ v N J 


where c is a constant and n = T(u, v) is the number of iterations when Euclid’s 
algorithm is applied to the integers u > v > 0 . 

A complete solution to Gauss’s problem was found a few years later by K. I. 
Babenko [Doklady Akad. Nauk SSSR 238 (1978), 1021-1024], who used powerful 
techniques of functional analysis to prove that 


F n (x) = lg(l + x) + Y, **(*) (42) 

j> 2 

for all 0 < x < 1 , n > 1. Here |X 2 1 > |X 3 1 > IX 4 I > ••*, and each Vj{z) is 
an analytic function in the complex plane except for a cut at [— 00 , — 1 ]. The 
function is Wirsing’s 'k, and X 2 = —X, while X 3 = 0.1009, X 4 — —0.0408, 
X 5 = —0.0355, X 6 = 0.0128. Babenko also established further properties of the 
eigenvalues X J? proving in particular that they are exponentially small as j —► 00 , 
and that the sum for j > k in (42) is bounded by ( 7 r 2 / 6 )|X fc ] n—1 minfo:, 1 — x). 
[Further information appears in a paper by Babenko and Iur’ev, Doklady Akad. 
Nauk SSSR 240 (1978), 1273-1276.] 

From continuous to discrete. We have now derived results about the prob¬ 
ability distributions for continued fractions when X is a real number uniformly 
distributed in the interval [0,1). But a real number is rational with probability 
zero (almost all numbers are irrational), so these results do not apply directly 
to Euclid’s algorithm. Before we can apply Theorem W to our problem, some 
technicalities must be overcome. Consider the following observation based on 
elementary measure theory: 
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Lemma M. Let fi, I 2 , • •., Ji, J 2 , • • • be pairwise disjoint intervals contained in 
the interval [0,1), and let 

I=i\Jh, J=\JJ k , K = [0,1] \(I U J). 

|fc>l fc>l 

j 

Assume that K ^ias measure zero. Let P n be the set {0/n, 1/n ,..., (n — 1 )/n}. 
Then 

| lim M. O . AJJ . = ( 43 ) 

n—»00 Tl 

Here p(I) is the tebesgue measure of J, namely, Y,k>i length(Jfc); and ||I f| P n || 
denotes the number of elements in the set I fl P n - 

Proof. Let I N 1= Ui<a:<n^ and ^ = Ui<fc<N^- Given e > 0, find N 
large enough so that p{In) + p(Jn) > 1 — e, and let 

Kn = K U |J Ik U |J 

k>N k>N 

{ 

If I is an interval, having any of the forms ( a,b ) or [a, b) or (a, b] or [a, 6], it is 
clear that p(I) =4 b — a and 

! 

I nn(I)-l < ||/nP„|| <n/i(J) + 1. 

Now let r n — ||n F n ||, s n — || Jm fl P n ||, t n = ||/Cjv D P n ||; we have 

j Tn “h $n ”h =: 

j np(I N ) — N < r n < np(I N ) + N; 

i np{I N ) — N<s n < nfj,(J N ) -f N. 


Hence 


til) - 4 < til*) 

n 


N ^ r n r n -)- t n 
n ~ n ~~ n 

a aj N 

- 1 —— < 1 — m(^n) H— < ti^) H- V e - 

n n n 


1 

This holds for al[ n and for all e; hence lim n ^oo r n /n = //(I). | 

Exercise 25 shows that Lemma M is not trivial, in the sense that some rather 

restrictive hypotheses are needed to prove (43). 

( 

Distribution of Partial quotients. Now we can put Theorem W and Lemma M 
together to derivie some solid facts about Euclid’s algorithm. 
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Theorem E. Let n and k be positive integers , and let p k (a, n) be the probability 
that the (k-\-l)st quotient Ak+i in Euclid's algorithm is equal to a , when v = n 
and u is chosen at random. Then 

lim p k (a,n) = F k ( - ) — F k ( —), 
n—*-oo \aj \a + l/ 

where F k (x) is the distribution function (21). 

Proof. The set 7 of all X in [0,1) for which Ak +i = o is a union of disjoint 
intervals, and so is the set J of all X for which A k -\-1 a. Lemma M therefore 
applies, with K the set of all X for which A k y\ is undefined. Furthermore, 
Ffc(l/a) — Ffc(l/(a -j- 1)) is the probability that 1 /{a -)- 1) < X k < 1/a, which 
is p(I), the probability that Ajt+i = a. | 

As a consequence of Theorems E and W, we can say that a quotient equal 
to a occurs with the approximate probability 

lg(l + l/a) — lg(l + l/(a + 1)) = lg((a + l) 2 /(( a + ~ l))- 


Thus 


a quotient of 1 occurs about lg(|) = 41.504 percent of the time; 

a quotient of 2 occurs about lg(f|) = 16.992 percent of the time; 

a quotient of 3 occurs about lg({f) = 9-311 percent of the time; 

a quotient of 4 occurs about lg(§|) = 5.890 percent of the time. 


Actually, if Euclid’s algorithm produces the quotients A,, .4 2 , ..., A t , the 
nature of the proofs above will guarantee this behavior only for Ah when k is 
comparatively small with respect to t; the values A t —i, A t _ 2 ,... are not covered 
by this proof. But we can in fact show that the distribution of the last quotients 
At— 1 , A t — 2 , ... is essentially the same as the first. 

For example, consider the regular continued fraction expansions for the set 
of all proper fractions whose denominator is 29: 


^ = / 29 1 

A = /14, 2/ 
^ = /9,1,2/ 

A = / 7 > 4 / 

& = /5,l,4/ 
A =/4,l,5/ 

A = / 4> 1 / 


I, = /3,1,1,1,2/ 
A = /3,4,2 / 

J8 =/2,l,9 / 

44 = / 2 ,i, 1 , 1 , 3 / 
44 = / 2 , 2 , 2 , 2 / 

49 = /2,4,3/ 

M = / 2,14/ 


ii = / 1 . 1 . 14 / 

iS = /l,1,4,3/ 

44 === / 1,1, 2 , 2 , 2 / 

4f = /i, 1 , 1 , 1 , 1 , 3 / 

= /1,1,1,9/ 

S = /l,2,4,2/ 

34 = /l,2,l,l,l,2/ 


1 - /l,3,7/ 

3| = /1,3,1,5/ 

34 = / 1 , 4 , 1 , 4 / 
gf = / 1 , 6 , 4 / 

§ = / 1 , 8 , 1 , 2 / 
34 = /1.13,2/ 

i = / 1 ,28/ 


Several things can be observed in this table. 
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a) As mentioned earlier, the last quotient is always 2 or more. Furthermore, 
we have the obvious identity 

, .. . , X<n —i, X n -f - 1/ — Jx i, . .. , X n —i , X n , 1 j , (44) 

and this shows how partial fractions whose last quotient is unity are related to 
regular continued fractions. 

b) The values in the right-hand columns have a simple relationship to the values 
in the left-hand columns; can the reader see the correspondence before reading 
any further? The relevant identity is 

1 — Ixt, x 2 ,..., tni — /l, Xi — 1, x 2 ,..., Xnl; (45) 

see exercise 9. 

c) There is symmetry between left and right in the first two columns: If 
IA u A 2 ,...,A t l occurs, so does / A t ,..., A 2 , Aif. This will always be the case 
(see exercise 26). 

d) If we examine all of the quotients in the table, we find that there are 96 in 

all, of which §§ = 40.6 percent are equal to 1, = 21.9 percent are equal to 2, 

^ = 8.3 percent are equal to 3; this agrees reasonably well with the probabilities 
listed above. 

The number of division steps. Let us now return to our original problem and 
investigate T n , the average number of division steps when v — n. (See Eq. (19).) 
Here are some sample values of T n : 


n = 

95 96 97 

98 99 100 

101 

102 103 

104 

105 

T n = 

5.0 4.4 5.3 

4.8 4.7 4.6 

5.3 

4.6 5.3 

4.7 

4.6 

n = 

996 997 998 

999 1000 

1001 

•••' 9999 

10000 

10001 

Tn — 

6.5 7.3 7.0 

6.8 6.4 

6.7 

8.6 

8.3 

9.1 

n = 

49999 50000 

50001 • • • 99999 

100000 100001 



10.6 9.7 

10.0 • • • 

10.7 

10.3 

11.0 



Note the somewhat erratic behavior; T n tends to be higher than its neighbors 
when n is prime, and it is correspondingly lower when n has many divisors. (In 
this list, 97, 101, 103, 997, and 49999 are primes; 10001 = 73 • 137, 50001 = 
3 • 7 • 2381, 99999 = 3 ■ 3 • 41 • 271, and 100001 = 11 • 9091.) It is not difficult 
to understand why this happens: if gcd(u, v) = d , Euclid’s algorithm applied 
to u and v behaves essentially the same as if it were applied to u/d and vjd. 
Therefore, when v = n has several divisors, there are many choices of u for 
which n behaves as if it were smaller. 

Accordingly let us consider another quantity, r n , which is the average num¬ 
ber of division steps when v = n and when u is relatively prime to n. Thus 

T - = 0 R T(m,n) 

' 0<m<n 

gcd(ra,n)=l 


( 46 ) 
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It follows that 


T n = ~^(p(d) T d . 


n 

d\n 


Here is a table of r n for the same values of n considered above: 


n — 

r n = 

n = 
'Tn = 
n — 

Tr, — 


95 96 97 98 99 100 101 102 103 104 105 

5.4 5.3 5.3 5.6 5.2 5.2 5.4 5.3 5.4 5.3 5.6 

996 997 998 999 1000 1001 • • • 9999 10000 10001 

7.2 7.3 7.3 7.3 7.3 7.4 ••• 9.21 9.21 9.22 

49999 50000 50001 • • • 99999 100000 100001 

10.58 10.57 10.59 ••• 11.170 11.172 11.172 


11.172 


Clearly r n is much more well-behaved than T n , and it should be more susceptible 
to analysis. Inspection of a table of r n for small n reveals some curious anomalies; 
for example, r 50 = Ti 0 o and r 6 o = ti 2 o- But as n grows, the values of r n behave 
quite regularly indeed, as the above table indicates, and they show no significant 
relation to the factorization properties of n. If the reader will plot the values of 
r n versus In n on graph paper, for the values of r n given above, he will see that 
the values lie very nearly on a straight line, and that the formula 

r n ^ 0.843Inn + 1.47 (48) 

is a very good approximation. 

We can account for this behavior if we study the regular continued fraction 
process a little further. Note that in Euclid’s algorithm as expressed in (15) we 
have 

Vo Vi V t —i _ V t -i 

U 0 U x '"U t .” U 0 ’ 

since Uk+i = 14; therefore if U = Uq and V — Vo are relatively prime, and if 
there are t division steps, we have 

KoXx.^Xt-i = l/U. 

Setting U = N and V = m < AT, we find that 

In Xq -|- In X\ -(- • ■ • In Xt i = — In TV. (49) 

We know the approximate distribution of X 0 , X 1} X 2 , ..., so we can use this 
equation to estimate 

t = T(N, m) = T(m, N ) — 1. 

Returning to the formulas preceding Theorem W, we find that the average 
value of lnX n , when X 0 is a real number uniformly distributed in [0,1), is 


[ In xF! n (x)dx= [ \nx f n (x)dx/(l + x), 
Jo Jo 


( 50 ) 
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where / n (z) is defined in (31). Now 


/»(*) = j^2 + 0(2-"), 


(51) 


using the facts we have derived earlier (see exercise 23); hence the average value 
of lnX n is very well approximated by 


1 f 1 In 2) 
In 2 Jo l-\-x 


dx — 


1_ f°° ue~ u 

1 2 Jo 1 H - e 


du 


In 2 Jo 1 + e- u 

, vs <-»«/; 


ue 


~ ku du 


k> 1 


hO 


1 — i _|_ i —L _|_ J:— 

In 2\ 4 9 16 25 


1 / * 1 , 1 , J 1 , 1,1 

V 2 C 1 + 4 + 9 H - 2 Ci + T7T + ^ + 


In 


2ib( 1 + i + ^ + 


16 ' 36 


= —7t 2 /(12 In 2). 


By (49) we therefore expect to have the approximate formula 


-*7r 2 /(121n2 )^—IniV; 

that is, t should be approximately equal to ((121n2)/7r 2 )lnTV. This constant 
(12 In 2)/7 t 2 = 0.842765913... agrees perfectly with the empirical formula (48) 
obtained earlier, so we have good reason to believe that the formula 

r n sa 12 ^ 2 Inn + 1.47 (52) 

7r 2 

indicates the true asymptotic behavior of r n as n —► 00. 

If we assume that (52) is valid, we obtain the formula 


T n sa ^?(lnn - J2Md)/d\ + 1.47, 

^ d\n ' 


(53) 


where A (d) is von Mangoldt’s function defined by the rules 



if n = p r for p prime and r > 1; 
otherwise. 


( 54 ) 
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For example, 


TlOO 


12 In 2 / In 2 

-— In 100 -— 


In 2 In 5 


+ 1.47 


(0.843)(4.605 — 0.347 
4.59; 


0.173 — 0.322 — 0.064) + 1.47 


the exact value of 7i 0 o is 4.56. 

We can also estimate the average number of division steps when u and v are 
both uniformly distributed between 1 and N, by calculating 


h £ 


l<n<7V 


Assuming formula (53), exercise 27 shows that this sum has the form 


12 In 2 


lniV + 0(l), 


and empirical calculations with the same numbers used to derive Eq. 4.5.2-45 
show good agreement with the formula 


121H2, 

-— In N + 0.06. (57) 

Of course we have not yet proved anything about T n and r n in general; so far we 
have only been considering plausible reasons why the above formulas ought to 
hold. Fortunately it is now possible to supply rigorous proofs, based on a careful 
analysis by several mathematicians. 

The leading coefficient ( 121 n 2 )/ 7 r 2 in the above formulas was established 
first, in independent studies by John D. Dixon and Hans A. Heilbronn. Dixon 
[J. Number Theory 2 (1970), 414-422] developed the theory of the F n (x) dis¬ 
tributions to show that individual partial quotients are essentially independent 
of each other in an appropriate sense, and proved that for all positive e we have 
| T(m, n) — (( 121 n 2 )/ 7 r 2 )lnn| < (lnn)^/ 2 ^ +e except for exp(—c(e)(log N) e / 2 )N 2 
values of m and n in the range 1 < m < n < N, where c(e) > 0. Heilbronn’s 
approach was completely different, working entirely with integers instead of con¬ 
tinuous variables. His idea, which is presented in slightly modified form in exer¬ 
cises 33 and 34, is based on the fact that r n can be related to the number of ways 
to represent n in a certain manner. Furthermore, his paper [Number Theory and 
Analysis, ed. by Paul Turan (New York: Plenum, 1969), 87-96] shows that the 
distribution of individual partial quotients 1 , 2 , ... that we have discussed above 
actually applies to the entire collection of partial quotients belonging to the frac¬ 
tions having a given denominator; this is a sharper form of Theorem E. A still 
sharper result was obtained several years later by J. W. Porter [Mathematika 22 
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(1975), 20-28], who established that 

r„ = Inn + C + 0( n - 1/6+£ ), (58) 



see D. E. Knuth, Computers and Math, with Applic. 2 (1976), 137-139. Thus 
the conjecture (48) is fully proved. 

The average running time for Euclid’s algorithm on multiple-precision in¬ 
tegers, using classical algorithms for arithmetic, was shown to be of order 

(l log(max(u, v)/gcd(u, v))) logmin(w, v) 

by G. E. Collins, in SIAM J. Computing 3 (1974), 1-10. 

Summary. We have found that the worst case of Euclid’s algorithm occurs when 
its inputs u and v are consecutive Fibonacci numbers (Theorem F); the number 
of division steps when v = n will never exceed [4.81og 10 N — 0.32]. We have 
determined the frequency of the values of various partial quotients, showing, for 
example, that the division step finds |u/uj = 1 about 41 percent of the time 
(Theorem E). And, finally, the theorems of Heilbronn and Porter prove that the 
average number T n of division steps when v = n is approximately 

((12 In 2 )/tt 2 ) In n « 1.9405 log 10 n, 

minus a correction term based on the divisors of n as shown in Eq. (53). 


EXERCISES 

► 1 . [20] Since the quotient [u/v J is equal to unity over 40 percent of the time in 
Algorithm 4.5.2A, it may be advantageous on some computers to make a test for this 
case and to avoid the division when the quotient is unity. Is the following MIX program 
for Euclid’s algorithm more efficient than Program 4.5.2A? 


LDX 

U 

rX <— u. 

JMP 

2F 


1H STX 

V 

v rX. 

SUB 

V 

rA +- u — v. 

CMPA 

V 


SRAX 

5 

rAX 4 - rA. 

JL 

2F 

Is U — V < V? 

DIV 

V 

rX 4- rAX mod v. 

2H LDA 

V 

rA 4— v. 

JXNZ 

IB 

Done if rX = 0. | 
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2. [M21\ Evaluate the matrix product 

fx\ 1 \( X2 ( Xn 

VI oA 1 oJ 'Vl ol 

3. [ M21 ] What is the value of 



4. [ M20 ] Prove Eq. ( 8 ). 

5. [ HM25 ] Let x\, x 2 , ... be a sequence of real numbers that are each greater than 
some positive real number e. Prove that the infinite continued fraction /x j, x 2 ,... / = 
lim n —oo /xi,..., x n I exists. Show also that /x i, X 2 , • - • / need not exist if we assume 
only that x 3 > 0 for all j. 

6 . [ M23] Prove that the regular continued fraction expansion of a number is unique 
in the following sense: If Si, S 2 , ... are positive integers, then the infinite continued 
fraction /Si, S 2 ,... / is an irrational number X between 0 and 1 whose regular con¬ 
tinued fraction has A n = B n for all n > 1 ; and if Si, ..., S m are positive integers 
with B m > 1, then the regular continued fraction for X = /Si,..., S m / has A n = S n 
for 1 < n < m. 

7. [ M26 ] Find all permutations p(l)p(2)... p(n ) of the integers {1, 2,..., n} such that 
Qn(x 1 ,X 2 , . . . , Xn) = Qn(x p (i), X p ( 2 ), . . . , X p ( n )) holds for all Xx, x 2 , Xn■ 

8 . [ M20] Show that — l/X n = /A n ,... ,A\, — X/, whenever X n is defined, in the 
regular continued fraction process. 

9. [M21] Show that continued fractions satisfy the following identities: 

n) Jx i, . . . , Xnl — /Xi, . . . , Xfc —(~ /Xfc-(-lj • . • , Xn//, 1 fc ^ Tl\ 

b) /0, XI, X 2 , • ■ . , Xnl = Xi -f /x 2 , . . . , Xnl, Tl> X] 

c) /xi, . . . , Xfc — i, Xfc, 0, Xfc_)_i, Xfc-)- 2 , • . . , Xn/ 

/xl, • • • , Xfc— 1 , Xfc | Ifc-j-i, Xfc —|— 2 , • . . , Xn/, 1 k 71, 

d) 1 — /Xx, X 2 , . . . , X n / = /1, Xl — 1, 12, • • ■ , Xnl, 71 > X. 

10. [A4i8S] By the result of exercise 6 , every irrational real number X has a unique 
regular continued fraction representation of the form 

X = Ao -j- /Ax,A 2 , As, ... /, 

where A 0 is an integer and Ax, A 2 , A 3 , ... are positive integers. Show that if X has 
this representation then the regular continued fraction for 1 /X is 

1 /X — So + /Si,..., B m ,A5,Ae ,... / 

for suitable integers So, Si, ..., B m . (The case Ao < 0 is, of course, the most 
interesting.) Explain how to determine the S’s in terms of Ao, Ai, A 2 , A 3 , and A 4 . 
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11 . [M80] (J. Lagrange.) Let X = Ao + /A 1} A 2 , ...f,Y = Bo + /Hi, B 2 ,... / 
be the regular continued fraction representations of two real numbers X and Y, in 
the sense of exercise 10. Show that these representations 4n the 

sense that A m+k = B n +k for some m and n and for all k > 0 , if and only if we 
have X = ( qY + ^/(sT' + t) for some integers q , r, s, t with \qt ~rs\ = 1. (This 
theorem is the analog, for continued fraction representations, of the simple result that 

and Y in the decimal system eventually agree if and only if 
some integers q , r, and s.) 

► 12. [M30] A quadratic irrationality is a number of the form (y/D — U)/V, where D, 
U, and V are integers, D > 0 , V 7 ^ 0, and D is not a perfect square. We may assume 
without loss of ge neralit y that V is a divisor of D — U 2 , for otherwise the number may 
be rewritten as (y/DV 2 — U\V\)/V\V\. 

a) Prove that the regular continued fraction expansion (in the sense of exercise 10) of 
a quadratic irrationality X = (y/D — U)/V is obtained by the following formulas: 

Vb = y, A> = |XJ, U 0 = U + AoV; 

Vn+l — {D U 2 )/V n , A n +1 ~ [{VD -f- U n )/V n +l J, 

Un-\-l — = A n -|-lVn-)-l Un . 


the representations of X 



[Note: An algorithm based on this process has many applications to the solution 
of quadratic equations in integers; see, for example, H. Davenport, The Higher 
Arithmetic (London: Hutchinson, 1952); W. J. LeVeque, Topics in Number Theory 
(Reading, Mass.: Addison-Wesley, 1956); and see also Section 4.5.4. By exercise 
1.2.4-35, we have A n+ 1 = [{[y/D\ + U n )/V n + 1 J when V n +i > 0 , and A n+1 = 

L(L%/AI + 1 + U n )/V n + 1 J when V n +i < 0 ; hence such an algorithm need only 
work with the positive integer [\/DJ.] 

b) Prove that 0 < U n < VD, 0 < V n < 2 \fD, for all n > N, where N is some in¬ 
teger depending on X ; hence the regular continued fraction representation of every 
quadratic irrationality is eventually periodic. [Hint: Show that (—y/D — U)/V — 
A 0 + /A 1 ,..., A n , —Vn/is/^D + Un)/, and use Eq. (5) to prove that (VD-\-U n )/V n 
is positive when n is large.] 

c) Letting p n = Q n +i{Ao,Ai ,..., A n ) and q n = Q n (Ai ,..., A n ), prove the identity 
Vpl + 2Up„q n + {(U 2 - D)/V)ql = (-l)"+ 1 K +a . 

d) Prove that the regular continued fraction representation of an irrational number X 
is eventually periodic if and only if X is a quadratic irrationality. (This is the 
continued fraction analog of the fact that the decimal expansion of a real number X 
is eventually periodic if and only if X is rational.) 

13. [ M 40 ] (J. Lagrange, 1797.) Let f(x) — a n x n -\ -(- a 0 , a n > 0, be a polynomial 

with integer coefficients, having no rational roots, and having exactly one real root 
£ > 1. Design a computer program to find the first thousand or so partial quotients 
of £, using the following algorithm (which essentially involves only addition): 

LI. Set A «— 1 . 

L2. For k = 0, 1, ..., n — 1 (in this order) and for j = n — 1, ..., k (in this 
order), set dj <— a^+i + aj. (This step replaces f(x) by g{x) — f(x -f-1), a 
polynomial whose roots are one less than those of /.) 

L3. If CLn -j- OLn—i +-[- <zo < 0, set A A -j- 1 and return to L2. 
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L4. Output A (which is the value of the next partial quotient). Replace the co¬ 
efficients ( a n , a n —i, ..., ao) by (—ao, —ai,..., —a„) and return to LI. (This 
step replaces f(x) by a polynomial whose roots are reciprocals of those of /.) 

For example, starting with f(x) = x 3 — 2, the algorithm will output “1” (changing 
/(x) to x 3 — 3x 2 — 3x — l); then “3” (changing /(x) to 10a : 3 — 6 x 2 — 6 x — l); etc. 

14. [M22] (A. Hurwitz, 1891.) Show that the following rules make it possible to find 
the regular continued fraction expansion of 2X, given the partial quotients of X: 

2 / 2a, b, c,... I = / a, 26 + 2 /c,...//; 

2 / 2 a + l, 6 ,c, ,../ = /a,l,l + 2 / 6 -l,c,...//. 

Use this idea to find the regular continued fraction expansion of \e, given the expansion 
of e in (13). 

► 15. [MSI] (R. W. Gosper.) Generalizing exercise 14, design an algorithm that com¬ 
putes the continued fraction X 0 -f /Xi, X 2 , ... / for (ax -f b)/(cx -f d), given the con¬ 
tinued fraction x 0 + / x 1} X 2 ,... / for x, and given integers a, b, c, d with ad 7 ^ be. Make 
your algorithm an “on-line coroutine” that outputs as many Xk as possible before in¬ 
putting each Xj. Demonstrate how your algorithm computes (97x -f 39)/(—62x — 25) 
when x = —1 + / 5,1,1,1,2,1,2/. 

16. [HM30] (L. Euler, 1731.) Let f 0 (z ) = ( e z — e~ z )/(e z -f €~ z ) = tanhz, and let 
fn+i(z) = 1 /fn(z) — ( 2 n + 1 )/z. Prove that, for all n, f n (z) is an analytic function of 
the complex variable z in a neighborhood of the origin, and it satisfies the differential 
equation f^(z) = 1 — f n (z) 2 — 2 nf n (z)/z. Use this fact to prove that 

tanh 2 = /z~ 1 ,3z“ 1 ,50 _1 ,72 — \.../. 

Then apply Hurwitz’s rule (exercise 14) to prove that 

e~ 1/n = / 1 , ( 2 m -f l)n — 1 , l/, m > 0 . 

(This notation denotes the infinite continued fraction /1, n — 1, 1, 1, 3n — 1, 1, 1, 
5 n — 1 , 1 , ... /.) Also find the regular continued fraction expansion of e~ 2 ^ n when 
n > 0 is odd. 

► 17. [ M28] (a) Prove that /xi,— X 2 I = /xi — 1,1,x 2 — 1/. (b) Generalize this 

identity, obtaining a formula for /xi, —x 2 , x 3 , —x 4 ,..., X 2 n—i, —x 2n / in which all 
partial quotients are positive integers when the x’s are large positive integers, (c) 
The result of exercise 16 implies that tan 1 = /1, —3,5, —7,.../. Find the regular 
continued fraction expansion of tanl. 

18. [ M 40 ] Develop a computer program to find as many partial quotients of x as 
possible, when x is a real number given with high precision. Use your program to 
calculate the first one thousand or so partial quotients of Euler’s constant 7 , based on 
D. W. Sweeney’s 3566-place value [Math. Comp. 17 (1963), 170-178]. (According to 
the theory in the text, we expect to get about 0.97 partial quotients per decimal digit. 
Cf. Algorithm 4.5.2L and the article by J. W. Wrench, Jr. and D. Shanks, Math. Comp. 
20 (1966), 444-447.) 

19. [M20] Prove that F(x) = log b (l -|- x) satisfies Eq. (24). 

20 . [HM20] Derive (36) from (35). 
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21. [HM29] (E. Wirsing.) The bounds (37) were obtained for a function <p correspond¬ 
ing to g with Tg(x ) = l/(x + !)• Show that the function corresponding to Tg(x) = 
1 /(x 4 - c) yields better bounds, when c > 0 is an appropriate constant. 

22. [ HM46 ] (K. I. Babenko.) Develop efficient means to calculate accurate approxima¬ 
tions to the quantities \j and ^(a;) in (42), for small j > 3 and for 0 < x < 1 . 

23. [HM23] Prove (51), using results from the proof of Theorem W. 

24. [M22] What is the average value of a partial quotient A n in the regular continued 
fraction expansion of a random real number? 

25. [ HM25 ] Find an example of a set I = I\ U h U h U • • • C [0,1], where the 7’s 
are disjoint intervals, for which (43) does not hold. 

26. [M23\ Show that if the numbers {1/n, 2/n,..., [n/2j/n} are expressed as regular 
continued fractions, the result is symmetric between left and right, in the sense that 
I At, ...,A 2 ,Ail appears whenever /Ai, A 2 , ...,A t l does. 

27. [M21] Derive (53) from (47) and (52). 

28. [M23] Prove the following identities involving the three number-theoretic func¬ 
tions <p(n), ju(n), A(n): 


a) ^ v{d) = 8 n 1 . 

d\n 

c) A(n) = ^/oQ^lnd, 

d\n 


b) In n = ^ A(d), n — <p(d). 

d\n d\n 

p( n )=J2^) d - 

d\n 


29. [M23] Assuming that T n is given by (53), show that (55) equals (56). 

► 30. [HM32] The following variant of Euclid’s algorithm is often suggested: Instead 
of replacing v by umodv during the division step, replace it by |(umodu) — v\ if 
u modu > \v. Thus, for example, if u = 26 and v = 7, we have gcd(26,7) = 
gcd(—2,7) = gcd(7,2); —2 is the remainder of smallest magnitude when multiples of 7 
are subtracted from 26. Compare this procedure with Euclid’s algorithm; estimate the 
number of division steps this method saves, on the average. 

► 31. [M35] Find the “worst case” of the modification of Euclid’s algorithm suggested 
in exercise 30; what are the smallest inputs u > v > 0 that require n division steps? 

32. [20] (a) A Morse code sequence of length n is a string of r dots and s dashes, 
where r + 2s = n. For example, the Morse code sequences of length 4 are 


Noting that the continuant Q4(xi,X2, X3,X4) is 2 : 10 : 22 : 32:4 + 2 : 12:2 + 2 : 12:4 + 2 : 32:4 + 1, 
find and prove a simple relation between Q n (2:i,. .., x n ) and Morse code sequences of 
length n. (b) (L. Euler, Novi Comm. Acad. Sci. Pet. 9 (1762), 53-69.) Prove that 

^m+n( 2 :i, ■ • • , 2 :?n+n) ==: Qm (xi,...,x m )Q n(Xm-\-l > • • • , Xm-^-n') 

~h Qm— 1 ( 2:1 , . . , , Xm —l)£?n—l( 2 :m+ 2 > • • • , 2 : m _|_ n ). 
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33. [MS2] Let h(ri) be the number of representations of n in the form 


n = xx' + yy', x > y > 0, x' > y' > 0, gcd(x, y) = 1, integer x, x', y, y'. 


(a) Show that if the conditions are relaxed to allow x' = y ', the number of repre¬ 
sentations is h(n ) + [(n — 1)/2J. (b) Show that for fixed y > 0 and 0 < t < y, 
where gcd(£, y) = 1, and for each fixed 2 / in the range 0 < x' < n/(y + 1) such that 
x't = n (modulo y), there is exactly one representation of n satisfying the restrictions 
of (a) and the condition x = t (modulo y). (c) Consequently 


h(n) = ^ 



L( n 1)/2J, 


where the sum is over all positive integers y, t, t' such that gcd(t, y) = 1, t < y, t' < y, 
tt' = n (modulo y). (d) Show that each of the h(n ) representations can be expressed 
uniquely in the form 

X — Qm{x 1) • • • > %m)i y =z Qrn — l{Xl, • - • > Xm —l)j 

X Qk{x>rn- 1-1> • • • > Xm + fc) d, y = Qk — l(Xm-j-2 j • • • > d, 

where m, k, d, and the Xj are positive integers with x\ > 2, > 2, and d is a 

divisor of n. The identity of exercise 32 now implies that n/d = Q m +k(x 1 ,..., Xm+k)- 
Conversely, any given sequence of positive integers xi, ..., x m +k such that X\ > 2, 
im+t > 2, and Q m -f-fc(zi,..., Zm+fc) divides n, corresponds in this way to m~\~ k — 1 
representations of n. (e) Therefore nT n — |_(5n — 3)/2j + 2 h(n). 

34. [HM40] (H. Heilbronn.) (a) Let hd{n) be the number of representations of n as in 
exercise 33 such that xd < x ', plus half the number of representations with xd = x '. 
Let g(n ) be the number of representations without the requirement that gcd(z, y) = 1. 
Prove that 

Hn) = ^2 hd (ji)' 

d\n d\n 

(b) Generalizing exercise 33(b), show that for d > 1, hd(n) = E [ n l{y{y+t)))+o{n), 
where the sum is over all integers y and t such that gcd(i, y) — 1 and 0 < t < y < 
\fnjd. (c) Show that Ei< y < n (^/(^ + 0) = ^(y)ln2 + 0(a—i(y)), where the sum is 
over the range 0 < t < y, gcd (t,y) = 1; and where cr—i(y) = Ed\y(*/d)- (d) Show 
that Ei< y <n p(y)/y 2 — Ei<d<n K d ) H ln/d\/d 2 . (e) Hence we have the asymptotic 
formula T n = ((121n2)/7r 2 )(lnn — E d \ n A(d)/d) + 0(<r_i(n) 2 ). 

35. [. HM41 ] (A. C. Yao and D. E. Knuth.) Prove that the sum of all partial quotients 
for the fractions m/n, for 1 < m < n, is equal to 2(J2[x/y\ -|- \n/2\), where the sum is 
over all representations n = xx' -f yy' satisfying the conditions of exercise 33(a). Show 
that ELz/?/J — 37r~ 2 n(ln n) 2 + 0(n log n (log log n) 2 ), and apply this to the “ancient” 
form of Euclid’s algorithm that uses only subtraction instead of division. 

36. [ M35] (G. H. Bradley.) What is the smallest value of u n such that the calculation 
of gcd(ui,..., u n ) by steps Cl and C2 in Section 4.5.2 requires N divisions, if Euclid’s 
algorithm is used throughout? Assume that N > n. 
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37. [MS8] (T. S. Motzkin and E. G. Straus.) Let ai ,..., a n be positive integers. Show 
that max Q n (a P { i),..., A P ( n )), over all permutations p(l)... p(n) of {1,2,..., n}, occurs 
when a P (i) > a P ( n ) > a p ( 2 ) > a p ( n — 1 ) > • • •; and the minimum occurs when a P (i) < 

®p(n) ^p(3) ^ U p ( n —2) ^ <2p(5) <2p(6) ^ <2p(n—3) ^ Op(4) <2p(n— 1) ^ <2p(2)» 

38. [M25] (J. Mikusinski.) Let X(n) = maXm>oT(m,n). Theorem F shows that 
K{ri) < [log^n -(- 1)J — 2; prove that K(n ) > iflog^x/iSn -|- 1)] — 2. 

► 39. [ M25 ] (R. W. Gosper.) If a baseball player’s batting average is .334, what is 
the fewest possible number of times he has been at bat? [Note for non-baseball-fans: 
Batting average = (number of hits)/(times at bat), rounded to three decimal places.] 

► 40. [. M28] ( The Stern-Peirce tree.) Consider an infinite binary tree in which each 
node is labeled with the fraction (pi p r )/(qi + q r ), where pi/qi is the label of the 
node’s nearest left ancestor and p r /q r is the label of the node’s nearest right ancestor. 
(A left ancestor is one that precedes a node in symmetric order, while a right ancestor 
follows the node. See Section 2.3.1 for the definition of symmetric order.) If the node 
has no left ancestors, pi/qi — 0/1; if it has no right ancestors, p r /qr — 1/0. Thus the 
label of the root is 1/1; the labels of its two sons are 1/2 and 2/1; the labels of the four 
nodes on level 2 are 1/3, 2/3, 3/2, and 3/1, from left to right; the labels of the eight 
nodes on level 3 are 1/4, 2/5, 3/5, 3/4, 4/3, 5/3, 5/2, 4/1; and so on. 

Prove that p is relatively prime to q in each label p/q; furthermore, the node 
labeled p/q precedes the node labeled p'/q' in symmetric order if and only if the labels 
satisfy p/q < p'/q'. Find a connection between the continued fraction for the label of 
a node and the path to that node, thereby showing that each positive rational number 
appears as the label of exactly one node in the tree. 

flP! M40] (J. Shallit, 1979.) Show that the regular continued fraction expansion of 


1 + 1 + 1 + ... = y 

2 1 ' 2 3 T 2 7 ' 


n> 1 



contains only l’s and 2’s and has a fairly Prove that the partial 

quotients of Liouville’s numbers 2^ n>1 ^~ n also have a regular pattern, when l is 
any integer > 2. [The latter numbers, introduced by J. Liouville in J. de Math. 
Pures et Appl. 16 (1851), 133-142, were the first explicitly defined numbers to be 
proved transcendental. The former number and similar constants were first proved 
transcendental by A. J. Kempner, Trans. Amer. Math. Soc. 17 (1916), 476-482.] 

42. [MSO] (J. Lagrange, 1798.) Let X have the regular continued fraction expansion 
IAi,A 2 , ... I, and let q n = Q n (Ai t ... ,A n ). Let ||x|| denote the distance from x to the 
nearest integer, namely min p \x — p\. Show that ||qX|| > ||q n _iX|| for 1 < q < q n . 
(Thus the denominators q n of the so-called convergents Pnjq-n — /Ai,..., A n / are the 
“record-breaking” integers that make ||qX|| achieve new lows.) 

43. [MSO] (D. W. Matula.) Show that the “mediant rounding” rule for fixed-slash 

or floating-slash numbers, Eq. 4.5.1-1, can be implemented simply as follows, when 
the number x > 0 is not representable: Let the regular continued fraction expansion 
of x be o»o -1 - /^i, U 2 , • • • /, and let pn — Qn-f-i(ao, • • •, (2n)> qn — 1 , • • •, fin)- Then 

round(x) = ( Pi/qi ), where ( Pi/qi ) is representable but (pi+i/q;+i) is not. [Hint: See 
exercise 40.] 
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44. [ M25} Suppose we are doing fixed slash arithmetic with mediant rounding, where 
the fraction ( u/u ') is representable if and only if \u\ < M and 0 < u' < N and 
gcd (u,u') = 1. Prove or disprove the identity (( u/u ') 0 (v/v')) 0 {v/v') = ( u/u ') for 
all representable (u/u') and (v/v'), provided that v! < \[N and no overflow occurs. 

45. [HM48] Develop the analysis of algorithms for computing the greatest common 
divisor of three or more integers. 


4.5.4. Factoring into Primes 

Several of the computational methods we have encountered in this book rest on 
the fact that every positive integer n can be expressed in a unique way in the 
form 

n = piP 2 ---Pu Pi < P 2 < < Pt, (1) 

where each pk is prime. (When n = 1, this equation holds for t = 0.) It is 
unfortunately not a simple matter to find this prime factorization of n, or to 
determine whether or not n is prime. So far as anyone knows, it is a great deal 
harder to factor a large number n than to compute the greatest common divisor 
of two large numbers m and n; therefore we should avoid factoring large numbers 
whenever possible. But several ingenious ways to speed up the factoring process 
have been discovered, and we will now investigate some of them. 

Divide and factor. First let us consider the most obvious algorithm for factor¬ 
ization: If n > 1, we can divide n by successive primes p = 2, 3, 5, ... until 
discovering the smallest p for which nmodp = 0. Then p is the smallest prime 
factor of n, and the same process may be applied to n <— n/p in an attempt 
to divide this new value of n by p and by higher primes. If at any stage we 
find that nmodp 0 0 but [n/p\ < p, we can conclude that n is prime; for if 
n is not prime, then by (1) we must have n > p\, but pi > p implies that 
P\ > (P+ l) 2 > p(p -j- 1) > p 2 T (nmodp) > \n/p\p 0 (nmodp) = n. This 
leads us to the following procedure: 

Algorithm A (Factoring by division ). Given a positive integer N, this algorithm 
finds the prime factors p 1 < P 2 < • * • < Pt of N as in Eq. (1). The method 
makes use of an auxiliary sequence of “trial divisors” 

2 = do < d \ < ^2 < ^3 < • • •, (2) 

which includes all prime numbers < -s/iV (and which may also include values 
that are not prime, if it is convenient to do so). The sequence of d ’s must also 
include at least one value such that <4 > vJV. 

Al. [Initialize.] Set t <— 0, k «— 0, n <— N. (During this algorithm the variables 
t, k, n are related by the following condition: “n = N/p\.. .pt, and n has 
no prime factors less than dk”) 

A2. [n = 1?] If n = 1, the algorithm terminates. 
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A3. [Divide.] Set q <— [n/dk\, r <— nmoddfc. (Here q and r are the quotient and 

remainder obtained when n is divided by <4.) 

A4. [Zero remainder?] If r ^ 0, go to step A6. 

A5. [Factor found.] Increase t by 1, and set Pt <— <4, ft <— q. Return to step A2. 
A6. [Low quotient?] If q > <4, increase /c by 1 and return to step A3. 

A7. [ft is prime.] Increase t by 1, set p t <— n, and terminate the algorithm. | 

As an example of Algorithm A, consider the factorization of the number 
N = 25852. We immediately find that N = 2 • 12926; hence pi = 2. Further¬ 
more, 12926 = 2 • 6463, so P 2 = 2. But now n = 6463 is not divisible by 2, 3, 5, 
..., 19; we find that n — 23 • 281, hence p 3 = 23. Finally 281 = 12 * 23 -f- 5 and 
12 < 23; hence p\ — 281. The determination of 25852’s factors has therefore 
involved a total of 12 division operations; on the other hand, if we had tried to 
factor the slightly smaller number 25849 (which is prime), at least 38 division 
operations would have been performed. This illustrates the fact that Algorithm A 
requires a running time roughly proportional to max(p t _i, y/pt). (If t — 1, this 
formula is valid if we adopt the convention po = 1.) 

The sequence do, d\, d 2 , ... of trial divisors used in Algorithm A can be 
taken to be simply 2, 3, 5, 7, 11, 13, 17, 19, 23, 25, 29, 31, 35, ..., where we 
alternately add 2 and 4 after the first three terms. This sequence contains all 
numbers that are not multiples of 2 or 3; it also includes numbers such as 25, 
35, 49, etc., which are not prime, but the algorithm will still give the correct 
answer. A further savings of 20 percent in computation time can be made by 
removing the numbers 30m ± 5 from the list for ra > 1, thereby eliminating all 
of the spurious multiples of 5. The exclusion of multiples of 7 shortens the list 
by 14 percent more, etc. A compact bit table can be used to govern the choice 
of trial divisors. 

If N is known to be small, it is reasonable to have a table of all the necessary 
primes as part of the program. For example, if N is less than a million, we need 
only include the 168 primes less than a thousand (followed by the value d 1Q 8 — 
1000, to terminate the list in case N is a prime larger than 997 2 ). Such a table 
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can be set up by means of a short auxiliary program, which builds the table just 
after the factoring program has been loaded into the computer; see Algorithm 
1.3.2P, or see exercise 8. 

How many trial divisions are necessary in Algorithm A? Let tc(x) be the 
number of primes < x, so that 7r(2) = 1, 7r(10) = 4; the asymptotic behavior 
of this function has been studied extensively by many of the world’s greatest 
mathematicians, beginning with Legendre in 1798. Numerous advances made 
during the nineteenth century culminated in 1899, when Charles de la Vallee 
Poussin proved that, for some A > 0, 

C x Ht 

n(x)= + 0 (ie- A vi^). (3) 

J 2 111 C 

[Mem. Couronnes Acad. Roy. Belgique 59 (1899), 1-74.] Integrating by parts 
yields 


x 


m = + 


X 


+ 


2! x 


In x (In x) 2 (In x) 


+ ••• + 


r! x 


(lm ) r + 1 


+ O 


((logi) r + 2 ) 


(4) 


for all fixed r > 0. The error term in (3) has subsequently been improved; 
for example, it can be replaced by 0(xexp(—A(logx) 3 / 5 /(logloga;) 1 / 5 )). [See 
A. Walfisz, Weyl’sche Exponentialsummen in der neueren Zahlentheorie (Berlin, 
1963), Chapter 5.] Bernhard Riemann conjectured in 1859 that 


*( x ) = L MAOMv'z)/ fc + 0(1) = Ux) — ) + • • • (5) 


where L(x) — / 2 X dt/lnt, and his formula agrees well with actual counts when x 
is of reasonable size. For example, we have the following table: 


X 

7 r ( x ) 

x/\nx 

L{x) 

Riemann’s formula 

10 3 

168 

144.8 

176.6 

168.36 

10 6 

78498 

72382.4 

78626.5 

78527.40 

10 9 

50847534 

48254942.4 

50849233.9 

50847455.43 


However, the distribution of large primes is not that simple, and Riemann’s con¬ 
jecture (5) was disproved by J. E. Littlewood in 1914; see Hardy and Littlewood, 
Acta Math. 41 (1918), 119-196, where it is shown that there is a positive con¬ 
stant C such that tt(x) > L(x) + C^/x log log log x /log x for infinitely many x. 
Littlewood’s result shows that prime numbers are inherently somewhat mys¬ 
terious, and it will be necessary to develop deep properties of mathematics 
before their distribution is really understood. Riemann made another much more 
plausible conjecture, the famous “Riemann hypothesis,” which states that the 
complex function $(z) is zero only when the real part of z is equal to J, except 
in the trivial cases where z is a negative even integer. This hypothesis, if true, 
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would imply that tt{x) = L{x)-\-0{y/x logx); see exercise 25. Richard Brent has 
used a method of D. H. Lehmer to verify Riemann’s hypothesis computationally 
for all “small” values of z , by showing that ${z) has exactly 75,000,000 zeros 
whose imaginary part is in the range 0 < %z < 32585736.4; all of these zeros 
have $tz = J and $'(z) ^ 0. [Math. Comp. 33 (1979), 1361-1372.] 

In order to analyze the average behavior of Algorithm A, we would like to 
know how large the largest prime factor p t will tend to be. This question was first 
investigated by Karl Dickman [Arkiv for Mat., Astron. och Fys. 22A, 10 (1930), 
1-14], who studied the probability that a random integer between 1 and x will 
have its largest prime factor < x a . Dickman gave a heuristic argument to show 
that this probability approaches the limiting value F(a) as x —► oo, where F can 
be calculated from the functional equation 

F(a) = ^ y, for 0 < a < 1; F(a) = 1 for a > 1. (6) 

His argument was essentially this: Given 0 < t < 1, the number of integers less 
than x whose largest prime factor is between x t and x t+dt is xF'(t)dt. The num¬ 
ber of primes p in that range is 7r(x t+dt ) — ^(x*) = 7r( x t -}- (In x)x t dt) — 7r(x*) = 
x t dt/t. For every such p, the number of integers n such that “np < x and 
the largest prime factor of n is < p” is the number of n < x 1 ~ t whose largest 
prime factor is < (x 1— t)t/(i—1) ? namely x 1 ~ t F{t/{ 1 — t)). Hence xF'(t)dt = 
(x t dt/t)(x 1 ~ t F(t/( 1 — t))), and (6) follows by integration. This heuristic argu¬ 
ment can be made rigorous; V. Ramaswami [Bull. Amer. Math. Soc. 55 (1949), 
1122-1127] showed that the probability in question for fixed a is asymptotically 
F(a)-f-0(l/log x), as x -*■ oo, and many other authors have extended the analysis 
[see the survey by Karl K. Norton, Memoirs Amer. Math. Soc. 106 (1971), 9-27]. 

If J < o: < 1, formula (6) simplifies to 


F(a) = 1 



1 + In a. 


Thus, for example, the probability that a random positive integer < x has a 
prime factor > y/x is 1 — F($) = In 2, about 69 percent. In all such cases, 
Algorithm A must work hard. 

The net result of this discussion is that Algorithm A will give the answer 
rather quickly if we want to factor a six-digit number; but for large N the amount 
of computer time for factorization by trial division will rapidly exceed practical 
limits, unless we are unusually lucky. 

Later in this section we will see that there are fairly good ways to determine 
whether or not a reasonably large number n is prime, without trying all divisors 
up to y/n. Therefore Algorithm A would often run faster if we inserted a 
primality test between steps A2 and A3; the running time for this improved 
algorithm would then be roughly proportional to pt—i, the second-largest prime 
factor of N, instead of to y/pl). By an argument analogous to 
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Fig. 11. Probability distribution functions for the 
two largest prime factors of a random integer < x. 


Dickman’s (see exercise 18), we can show that the second-largest prime factor of 
a random integer x will be < x 13 with approximate probability G(/3), where 

h))* 4- (t> 

Clearly G((3) = 1 for (5 > J. (See Fig. 11.) Numerical evaluation of (6) and (7) 
yields the following “percentage points”: 

F{ct),G{l3 ) = .01 .05 .10 .20 .35 .50 .65 .80 .90 .95 .99 

a = .2697 .3348 .3785 .4430 .5220 .6065 .7047 .8187 .9048 .9512 .9900 

/? = .0056 .0273 .0531 .1003 .1611 .2117 .2582 .3104 .3590 .3967 .4517 

Thus, the second-largest prime factor will be < ar 2117 about half the time, etc. 

The total number of prime factors , t, has also been intensively analyzed. 
Obviously 1 < t < lg N, but these lower and upper bounds are seldom achieved. 
It is possible to prove that if N is ch osen at random between 1 and x, the 
probability that t < In In x -)- cy/ In In x approaches 

—— f e~ u2 / 2 du ( 8 ) 

as x —► oo, for any fixed c. In other words, the distribution of t is essentially 
normal, with mean and variance In In a:; about 99.73 percent of all the large 
integers < x have 1 1 — In In x \ < 3\/ln \nx. Furthermore the average value of 
t — In In x for 1 < N < x is known to approach 

7 + ( ln (! — 1/P) + l/(p - 1)) = 1.03465 38818 97438. 

p prime 
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[Cf. G. H. Harcty and E. M. Wright, 

“ L "!] 

The size of prime factors has a remarkable connection with permutations: 
The average number of bits in the fcth largest prime factor of a random n-bit 
integer is asymptotically the same as the average length of the kth largest cycle 
of a random n-element permutation, as n —► oo. [See D. E. Knuth and L. Trabb 
Pardo, Theoretical Comp. Sci. 3 (1976), 321-348.] It follows that Algorithm A 
usually finds a few small factors and then begins a long-drawn-out search for the 
big ones that are left. 

Factoring h la Monte Carlo. Near the beginning of Chapter 3, we observed 
that “a random number generator chosen at random isn’t very random.” This 
principle, which worked against us in that chapter, has the redeeming virtue 
that it leads to a surprisingly efficient method of factorization, discovered by 
J. M. Pollard [BIT 15 (1975), 331-334]. The number of computational steps 
in Pollard’s method is on the order of \/pt~i, so it is significantly faster than 
Algorithm A when N is large. According to (7) and Fig. 11, the running time 
will usually be well under AT 1 / 4 . 

Let f(x ) be any polynomial with integer coefficients, and consider the two 
sequences defined by 

x 0 = Vo = A; Xm+i = /(xm)modiV, p m +i = /(y m )modp, (9) 

where p is any prime factor of N. It follows that 


y m — %m modp, for m > 1. 


Now exercise 3.1-7 shows that we will have y m = yi(m)— l for some m > 1, 
where l(m) is the greatest power of 2 that is < m. Thus x m — £j( m )_i will be 
a multiple of p. Furthermore if f(y)modp behaves as a random mapping from 
the set {0,1,... ,p — 1} into itself, exercise 3.1-12 shows that the average value 
of the least such m will be of order a/p- In fact, exercise 4 below shows that this 
average value for random mappings is less than 1.625 Q(p), where the function 
Q(p) \fwpj2 was defined in Section 1.2.11.3. If the different prime divisors 
of N correspond to different values of m (as they almost surely will, when N 
is large), we will be able to find them by calculating gcd(x m — £i( m )—i, N ) for 
m = 1, 2, 3, ..., until the unfactored residue is prime. 

From the theory in Chapter 3, we know that a linear polynomial f{x) = 
ax -f- c will not be sufficiently random for our purposes. The next-simplest case 
is quadratic, say f(x) — x 2 1; although we don’t know that this function 
is sufficiently random, our lack of knowledge tends to support the hypothesis 
of randomness, and empirical tests show that this / does work essentially as 
predicted. In fact, / is probably slightly better than random, since x 2 + 1 takes 
on only \{p -f- 1) distinct values mod p. Therefore the following procedure is 
reasonable: 
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Algorithm B ( Monte Carlo factorization). This algorithm outputs the prime 
factors of a given integer TV > 2, with high probability, although there is a 
chance that it will fail. 

Bl. [Initialize.] Set x •*— 5, x' «— 2, k <— 1, l <— 1, n <— TV. (During this algorithm, 
n is the unfactored part of TV, and the variables x and x' represent the 
quantities x m modn and xum )—i modn in (9), where f(x) = x 2 -f-1, A = 1, 
l = l(m), and k = 21 — m.j 

B2. [Test primality.] If n is prime (see the discussion below), output n; the 
algorithm terminates. 

B3. [Factor found?] Set g <— gcd(a/ — x, n). If g = 1, go on to step B4; otherwise 
output g. Now if g = n, the algorithm terminates (and it has failed, because 
we know that n isn’t prime). Otherwise set n <— n/g, x «— x modn, x' <— 
x' modn, and return to step B2. (Note that g may not be prime; this should 
be tested. In the rare event that g isn’t prime, its prime factors probably 
won’t be determinable with this algorithm.) 

B4. [Advance.] Set k <— k — 1. If k = 0, set x' <— x, l <— 21, k <— l. Set 
x <— (x 2 -(- 1) mod n and return to B3. | 

As an example of Algorithm B, let’s try to factor TV = 25852 again. The 
third execution of step B3 will output g = 4 (which isn’t prime). After six 
more iterations the algorithm finds the factor g = 23. Algorithm B has not 
distinguished itself in this example, but of course it was designed to factor big 
numbers. Algorithm A takes much longer to find large prime factors, but it can’t 
be beat when it comes to removing the small ones. In practice, we should run 
Algorithm A awhile before switching over to Algorithm B. 

We can get a better idea of Algorithm B’s prowess by considering the ten 
largest six-digit primes. The number of iterations, m(p), that Algorithm B needs 
to find the factor p is given in the following table: 

p = 999863 999883 999907 999917 999931 999953 999959 999961 999979 999983 
m{p) = 276 409 2106 1561 1593 1091 474 1819 395 814 

Experiments indicate that m{p) has an average value of about 2 y/p, and it 
never exceeds 12 y/p when p < 1000000. The maximum m(p) for p < 10 6 is 
m(874771) = 7685; and the maximum of m(p) / y/p occurs when p = 290047, 
m(p) 6251. According to these experimental results, almost all 12-digit 
numbers can be factored in fewer than 2000 iterations of Algorithm B (compared 
to roughly 100,000 divisions in Algorithm A). 

The time-consuming operations in each iteration of Algorithm B are the 
multiple-precision multiplication and division in step B4, and the gcd in step B3. 
If the gcd operation is slow, Pollard suggests gaining speed by accumulating the 
product mod n of, say, ten consecutive (x f —x) values before taking each gcd; this 
replaces 90 percent of the gcd operations by a single multiplication and division 
while only slightly increasing the chance of failure. He also suggests starting 
with m = q instead of m — 1 in step Bl, where q is, say, ^ the number of 
iterations you are planning to use. 
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Iii those rare cases where failure occurs for large N, we could try using 
f(x) =>= x 2 + c for some 0 or 1 . The value c = —2 should also be avoided, 
since the recurrence x m +\ = -2 h as solutions of the form x m = r 2 ™ -j-r* -2 ™. 

Other values of c do not seem to lead to simple relationships mod p, and they 
should all be satisfactory when used with suitable starting values. 

Richard Brent used a modification of Algorithm B in 1980 to discover the 
prime factor 1238926361552897 of 2 256 + 1. 

Fermat’s method. Another approach to the factoring problem, which was used 
by Pierre de Fermat in 1643, is more suited to finding large factors than small 
ones. [Fermat’s original description of his method, translated into English, can 
be found in L. E. Dickson’s monumental History of the Theory of Numbers 1 
(New York: Chelsea, 1952), 357.] 

Assume that N = uv, where u < v. For practical purposes we may assume 
that N is odd; this means that u and v are odd, and we can let 

x = (u + v)/2, y = (v — u)/2, (11) 

N = x 2 — y 2 , 0 < y < x < N. (12) 

Fermat’s method consists of searching systematically for values of x and y that 
satisfy Eq. (12). The following algorithm shows how factoring can therefore be 
done without using any division: 

Algorithm C ( Factoring by addition and subtraction ). Given an odd number N, 
this algorithm determines the largest factor of N less than or equal to y/N. 

Cl. [Initialize.] Set x' 4- 2[ViVj + 1, y' <— 1, r 4 — [\/NJ 2 — N. (During this 
algorithm x ', y’, r correspond respectively to 2x -j- 1, 2y -f 1, x 2 —y 2 — N 
as we search for a solution to ( 12 ); we will have |r| < x' and y' < x'.) 

C2. [Test r.] If r < 0, go to step C4. 

C3. [Step y.] Set r 4 - r — y', y' <- y' -\- 2 , and return to C 2 . 

C4. [Done?] If r — 0, the algorithm terminates; we have 

N = ((*' _ y')/2)((x f + y' ~ 2)/2), 

and (x' — y')/2 is the largest factor of N less than or equal to y/N. 

C5. [Step x.] Set r 4 — r + x', x' <— x' 2, and return to C3. | 

The reader may find it amusing to find the factors of 377 by hand, using this 
algorithm. The number of steps needed to find the factors u and v of N = uv 
is essentially proportional to (x' + y' — 2)/2 — [y/N\ = v — |_\/NJ; this can, 
of course, be a very large number, although each step can be done very rapidly 
on most computers. An improvement that requires only 0(AT 1 / 3 ) operations in 
the worst case has been developed by R. S. Lehman [Math. Comp. 28 (1974), 
637-646]. 

It is not quite correct to call Algorithm C “Fermat’s method,” since Fermat 
used a somewhat more streamlined approach. Algorithm C’s main loop is quite 
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fast on computers, but it is not very suitable for hand calculation. Fermat 
actually did not keep the running value of y; he would look at x 2 — N and 
tell whether or not this quantity was a perfect square by looking at its least 
significant digits. (The last two digits of a perfect square must be 00, el, e4, 25, 
06 , or e9, where e is an even digit and o is an odd digit.) Therefore he avoided the 
operations of steps C 2 and C3, replacing them by an occasional determination 
that a certain number is not a perfect square. 

Fermat’s method of looking at the rightmost digits can, of course, be general¬ 
ized by using other moduli. Suppose for clarity that N = 11111 , and consider 
the following table: 


m if x mod m is 


then x 2 mod m is 


and (x 2 — N ) mod m is 


3 0,1,2 

5 0,1, 2, 3, 4 

7 0,1, 2, 3, 4, 5,6 

8 0,1, 2, 3, 4, 5,6, 7 

11 0,1,2,3,4,5,6,7,8,9,10 


0 , 1,1 

0,1,4,4,1 

0,1,4,2,2,4,1 

0,1,4,1,0,1,4,1 

0,1,4,9, 5,3,3,5,9, 4,1 


1 , 2,2 

4,0,3, 3,0 

5,6, 2,0,0, 2, 6 

1 , 2 , 5 , 2 , 1 , 2 , 5,2 

10,0,3,8,4,2,2,4,8,3,0 


If x 2 — N is to be a perfect square y 2 , it must have a residue mod m consistent 
with this fact, for all m. For example, if N = 11111 and xmod3 7 ^ 0, then 
(x 2 — N ) mod 3 = 2, so x 2 — N cannot be a perfect square; therefore x must be 
a multiple of 3 whenever 11111 — x 2 —y 2 . The table tells us, in fact, that 


xmod3 = 0 ; 
xmod 5 = 0, 1, or 4; 

xmod7 = 2, 3, 4, or 5; (13) 

x mod 8 = 0 or 4 (hence x mod 4 = 0); 
xmod 11 = 1, 2, 4, 7, 9, or 10. 


This narrows down the search for x considerably. For example, x must be a 
multiple of 12. We must have x > \y/N ] = 106, and it is easy to verify that the 
first value of x > 106 that satisfies all of the conditions in (13) is x = 144. Now 
144 2 — 11111 = 9625, and by attempting to take the square root of 9625 we find 
that it is not a square. The first value of x > 144 that satisfies (13) is x = 156. 
In this case 156 2 —11111 = 13225 = 115 2 ; so we have found the desired solution 
x = 156, y = 115. This calculation shows that 11111 = 41 • 271. 

The hand calculations involved in the above example are comparable to the 
amount of work required to divide 11111 by 13, 17, 19, 23, 29, 31, 37, and 41, 
even though the factors 41 and 271 are not very close to each other; thus we can 
see the advantages of Fermat’s method. 

In place of the moduli considered in (13), we can use any powers of distinct 
primes. For example, if we had used 25 in place of 5, we would find that the 
only permissible values of xmod25 are 0, 5, 6, 10, 15, 19, and 20. This gives 
more information than (13). In general, we will get more information modulo p 2 
than we do modulo p, for odd primes p, whenever x 2 — N = 0 (modulo p) has 
a solution x. 
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The modular method just used is called a sieve procedure, since we can 
imagine passing all integers through a “sieve” for which only those values with 
zmod3 = 0 come out, then sifting these numbers through another sieve that 
allows only numbers with a; mod 5 = 0, 1, or 4 to pass, etc. Each sieve by itself 
will remove about half of the remaining values (see exercise 6); and when we sieve 
with respect to moduli that are relatively prime in pairs, each sieve is independent 
of the others because of the Chinese remainder theorem (Theorem 4.3.2C). So if 
we sieve with respect to, say, 30 different primes, only about one value in every 
2 30 will need to be examined to see if x 2 — N is a perfect square y 2 . 

Algorithm D ( Factoring with sieves). Given an odd number N, this algorithm 
determines the largest factor of N less than or equal to %/AT. The procedure 
uses moduli mi, m 2 , ..., m r that are relatively prime to each other in pairs and 
relatively prime to N. We assume that r “sieve tables” S[i,j] for 0 < j < m<, 
1 < i < r, have been prepared, where 

qr • *i __ /if j 2 — N = y 2 (modulo mj has a solution y\ 
to, otherwise. 

Dl. [Initialize.] Set x «- \VN], and set ki <— (-i)modmj for 1 < i < r. 
(Throughout this algorithm the index variables ki, k<i, ..., k r will be set so 
that (— x) mod 

D2. [Sieve.] If £[*, ki] = 1 for 1 < i <r, go to step D4. 

D3. [Step x.] Set x *— x l, and set ki <— (ki — l)modmi for 1 < i < r. 
Return to step D2. 

D4. [Test x 2 — N.] Set y <— \ y/x 2 — N\ or to \\/x 2 — N\ If y 2 = x 2 — N, 
then (x — y ) is the desired factor, and the algorithm terminates. Otherwise 
return to step D3. | 

There are several ways to make this procedure run fast. For example, we 
have seen that if N mod 3 = 2, then x must be a multiple of 3; we can set x = 3x', 
and use a different sieve corresponding to x f , increasing the speed threefold. If 
Nmod9 = 1, 4, or 7, then x must be congruent respectively to ±1, ±2, or ±4 
(modulo 9); so we run two sieves (one for x' and one for x", where x = 9^ -f a 
and x — 9x" — a) to increase the speed by a factor of 4j. If iVmod4 = 3, 
then 2 mod 4 is known and the speed is increased by an additional factor of 4; 
in the other case, when N mod 4 = 1, x must be odd so the speed may be 
doubled. Another way to double the speed of the algorithm (at the expense of 
storage space) is to combine pairs of moduli, using m r _fcmfc in place of m*, for 
1 < < Jr. 

An even more important method of speeding up Algorithm D is to use the 
“Boolean operations” found on most binary computers. Let us assume, for 
example, that MIX is a binary computer with 30 bits per word. The tables S[i t ki] 
can be kept in memory with one bit per entry; thus 30 values can be stored in a 
single word. The operation AND, which replaces the /cth bit of the accumulator 
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by zero if the kth bit of a specified word in memory is zero, for 1 < k < 30, can 
be used to process 30 values of x at once! For convenience, we can make several 
copies of the tables S[i,j] so that the table entries for m l involve lcm(m 1 ,30) 
bits; then the sieve tables for each modulus fill an integral number of words. 
Under these assumptions, 30 executions of the main loop in Algorithm D are 
equivalent to code of the following form: 


LD1 

K1 

LDA 

SI. 1 

DEC1 

1 

JINN 

*+2 

INC1 

Ml 

ST1 

K1 

LD1 

K2 

AND 

S2,1 

DEC1 

1 

JINN 

*+2 

INC1 

M2 

ST1 

K2 

LD1 

K3 

ST1 

Kr 

INCX 

30 

JAZ 

D2 


rll <- k[. 
rA<- S'[l,rll\. 
rll «— rll — 1. 

If rll < 0, set rll <— rll + lcm(mi, 30). 
k'i «— rll. 
rll «- k' 2 . 

rA <— rA A S'[2, rll]. 
rll rll — 1. 

If rll < 0, set rll rll -(- lcm(m 2 , 30). 
k' 2 rll. 
rll 4 — k' 3 . 

(ms through m r are like m 2 ) 

K <- rll. 
x <— x -j- 30. 

Repeat if all sieved out. | 


The number of cycles for 30 iterations is essentially 2 —f- 8r; if r — 11, this 
means three cycles are being used on each iteration, just as in Algorithm C, and 
Algorithm C involves y = ^(v — u) more iterations. 

If the table entries for m* do not come out to be an integral number of 
words, further shifting of the table entries would be necessary on each iteration 
in order to align the bits properly. This would add quite a lot of coding to the 
main loop and it would probably make the program too slow to compete with 
Algorithm C unless v/u < 100 (see exercise 7). 

Sieve procedures can be applied to a variety of other problems, not neces¬ 
sarily having much to do with arithmetic. A survey of these techniques has been 
prepared by Marvin C. Wunderlich, JACM 14 (1967), 10-19. 

Special sieve machines (of reasonably low cost) have been constructed by 
D. H. Lehmer and his associates over a period of many years; see, for example, 
AMM 40 (1933), 401-406. Lehmer’s electronic delay-line sieve, which began 
operating in 1965, processes one million numbers per second. Thus, each iteration 
of the loop in Algorithm D can be performed in one microsecond on this device. 
Another way to factor with sieves is described by D. H. and Emma Lehmer in 
Math. Comp. 28 (1974), 625-635. 


Primality testing. None of the algorithms we have discussed so far is an efficient 
way to determine that a large number n is prime. Fortunately, there are other 
methods available for settling this question; efficient methods have been devised 
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by E. Lucas and others, notably D. H. Lehmer (see Bull. Amer. Math. Soc. 33 
(1927), 327-340]. 

According to Fermat’s theorem (Theorem 1.2.4F), we have x v ~ l modp = 1 
whenever p is prime and x is not a multiple of p. Furthermore, there are 
efficient ways to calculate x n—1 mod n, requiring only O(logn) operations of 
multiplication mod n. (We shall study these in Section 4.6.3 below.) Therefore 
we can often determine that n is not prime when this relationship fails. 

For example, Fermat once verified that the numbers 2 1 -j-l, 2 2 +1, 2 4 -f-l, 
2 8 + 1 , and 2 16 + 1 are prime. In a letter to Mersenne written in 1640, Fermat 
conjectured that 2 2n 1 is always prime, but said he was unable to determine 

definitely whether the number 4294967297 = 2 32 + 1 is prime or not. Neither 
Fermat nor Mersenne ever resolved this problem, although they could have done 
it as follows: The number 3 2 mod(2 32 + 1) can be computed by doing 32 
operations of squaring modulo 2 32 + 1, and the answer is 3029026160; therefore 
(by Fermat’s own theorem, which he discovered in the same year 1640!) the 
number 2 32 -f -1 is not prime. This argument gives us absolutely no idea what 
the factors are, but it answers Fermat’s question. 

Fermat’s theorem is a powerful test for showing non-primality of a given 
number. When n is not prime, it is always possible to find a value of x < n 
such that x n ~ 1 modn 7 ^ 1 ; experience shows that, in fact, such a value can 
almost always be found very quickly. There are some rare values of n for which 
x n ~ 1 mod n is frequently equal to unity, but then n has a factor less than y / n; 
see exercise 9. 

The same method can be extended to prove that a large prime number n 
really is prime, by using the following idea: If there is a number x for which 
the order of x modulo n is equal to n — 1, then n is prime. (The order of x 
modulo n is the smallest positive integer k such that x k modn = 1; see Section 
3.2.1.2.) For this condition implies that the numbers x 1 modn, ..., x n ~ 1 modn 
are distinct and relatively prime to n, so they must be the numbers 1 , 2 , ..., 
n—1 in some order; thus n has no proper divisors. If n is prime, such a number x 
(called a “primitive root” of n) will always exist; see exercise 3.2.1.2-16. In fact, 
primitive roots are rather numerous. There are <p(n — 1) of them, and this is 
quite a substantial number, since n/<p(n — 1) = O(loglogn). 

It is unnecessary to calculate x k mod n for all k < n — 1 to determine if the 
order of x is n — 1 or not. The order of x will be n — 1 if and only if 

i) x n ~ 1 modn = 1 ; 

ii) x( n ~ 1 ^ p modn 7 ^ 1 for all primes p that divide n — 1 . 

For x s mod n = 1 if and only if s is a multiple of the order of x modulo n. If 
the two conditions hold, and if k is the order of x modulo n, we therefore know 
that A; is a divisor of n — 1 , but not a divisor of (n — l)/p for any prime factor p 
of n — 1 ; the only remaining possibility is k = n — 1. This completes the proof 
that conditions (i) and (ii) suffice to establish the primality of n. 

Exercise 10 shows that we can in fact use different values of x for each of 
the primes p, and n will still be prime. We may restrict consideration to prime 
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values of x, since the order of uv modulo n divides the least common multiple 
of the orders of u and v by exercise 3.2.1.2-15. Conditions (i) and (ii) can be 
tested efficiently by using the rapid methods for evaluating powers of numbers 
discussed in Section 4.6.3. But it is necessary to know the prime factors of n— 1, 
so we have an interesting situation in which the factorization of n depends on 
that of n — 1. 

An example. The study of a reasonably typical large factorization will help to 
fix the ideas we have discussed so far. Let us try to find the prime factors of 
2 214 + 1, a 65-digit number. The factorization can be initiated with a bit of 
clairvoyance if we notice that 

2 214 + 1 = (2 107 — 2 54 + 1)(2 107 + 2 54 + 1); (14) 

this identity is a special case of some factorizations discovered by A. Aurifeuille 
in 1873 [see Dickson’s History , 1, p. 383]. The problem now boils down to 
examining each of the 33-digit factors in (14). 

A computer program readily discovers that 2 107 — 2 54 -f- 1 = 5 • 857 • n 0 , 
where 

n 0 = 37866809061660057264219253397 (15) 

is a 29-digit number having no prime factors less than 1000. A multiple-precision 
calculation using the “binary method” of Section 4.6.3 shows that 

3 no—1 mod no = 1, 


so we suspect that no is prime. It is certainly out of the question to prove that 
no is prime by trying the 10 million million or so potential divisors, but the 
method discussed above gives a feasible test for primality: our next goal is to 
factor no — 1. With little difficulty, our computer will tell us that 

n 0 — 1 = 2 • 2 • 19 ■ 107 • 353 • n u m = 13191270754108226049301. 

Here 3 ni—1 modni 7 ^ 1, so n 4 is not prime; by continuing Algorithm A or 
Algorithm B we find 


ni = 91813 • n 2 , n 2 = 143675413657196977. 

This time 3 n 2 ~^od^ = 1, so we will try to prove that n 2 is prime. This 
requires the factorization n 2 — 1 = 2-2-2*2-3-3 - 547 • 713 , where n 3 = 
1824032775457. Since 3 n 3 — 1 modn 3 7 ^ 1 , we know that n 3 is composite, and 
Algorithm A finds that n 3 = 1103 • n 4 , where n 4 = 1653701519. The number 
n 4 behaves like a prime (i.e., 3 n4—1 modn 4 = 1 ), so we calculate 


n 4 — 1 = 2 • 7 • 19 • 23 • 137 • 1973. 
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Good; this is our first complete factorization. We are now ready to backtrack 
to the previous subproblem, proving that n 4 is prime. Using the procedure 
suggested by exercise 10 , we compute the following values: 


X 

P 

x ( n 4 1 )/Pmodn 4 x ni 

1 modn 4 

2 

2 

1 

( 1 ) 

2 

7 

766408626 

( 1 ) 

2 

19 

332952683 

( 1 ) 

2 

23 

1154237810 

(!) (16) 

2 

137 

373782186 

( 1 ) 

2 

1973 

490790919 

( 1 ) 

3 

2 

1 

( 1 ) 

5 

2 

1 

( 1 ) 

7 

2 

1653701518 

1 

(Here “( 1 )” means a result of 1 that needn’t be computed since it can be deduced 

from previous calculations.) Thus n 4 is prime, and n 2 - 

-1 has been completely 

factored. A similar calculation shows that n 2 is prime, 

and this complete fac- 

torization of no - 

— 1 finally shows [after still another calculation like (16)] that 

no is prime. 




The last three lines of (16) represent a search for an integer x that satisfies 

^*(714 l )/2 

- 1 = 1 (modulo n 4 ). If n 4 is prime, we have only a 50-50 chance 

of success, so the case p = 

2 is typically the hardest one to verify. We could 


streamline this part of the calculation by using the law of quadratic reciprocity 
(cf. exercise 23), which tells us for example that 5 ^ q—1 ^ 2 = 1 (modulo q) 
whenever q is a prime congruent to il (modulo 5). Merely calculating n 4 mod 5 
would have told us right away that x — 5 could not possibly help in showing 
that n 4 is prime. In fact, however, the result of exercise 26 implies that the case 
p = 2 doesn’t really need to be considered at all when testing n for primality, 
unless n — 1 is divisible by a high power of 2 , so we could have dispensed with 
the last three lines of (16) entirely. 

The next quantity to be factored is the other half of (14), namely 

n 5 = 2 107 + 2 54 + 1 . 

Since 3 ns—1 modn 5 ^ 1 , we know that n 5 is not prime, and Algorithm B 
shows that n 5 = 843589 • n^, where n 6 — 192343993140277293096491917. 
Unfortunately, 3 n e — 1 modn 6 7 ^ 1 , so we are left with a 27-digit nonprime. Con¬ 
tinuing Algorithm B might well exhaust our patience (not our budget—nobody 
is paying for this, we’re using idle time on a weekend rather than “prime time”). 
But the sieve method of Algorithm D will be able to crack into its two factors, 

n 6 = 8174912477117 • 23528569104401. 

This result could not have been discovered by Algorithm A in a reasonable length 
of time. (A few million iterations of Algorithm B would probably have sufficed.) 
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Now the computation is complete: 2 214 -f -1 has the prime factorization 

5 • 857 • 843589 ■ 8174912477117 • 23528569104401 * n 0 , 

where no is the 29-digit prime in (15). A certain amount of good fortune entered 
into these calculations, for if we had not started with the known factorization 
(14) it is quite probable that we would first have cast out the small factors, 
reducing n to n 6 no . This 55-digit number would have been much more difficult 
to factor—Algorithm D would be useless and Algorithm B would have to work 
overtime because of the high precision necessary. 

Dozens of further numerical examples can be found in an article by John 
Brillhart and J. L. Selfridge, Math. Comp. 21 (1967), 87-96. 

Improved primality tests. Since the above procedure for proving that n is prime 
requires the complete factorization of n — 1 , it will bog down for large n. 
Another technique, which uses the factorization of n -(-1 instead, is described in 
exercise 15; if n — 1 turns out to be too hard, n + 1 might be easier. 

Significant improvements are available for dealing with large n. For example, 
it is not difficult to prove a stronger converse of Fermat’s theorem that requires 
only a partial factorization of n — 1. Exercise 26 shows that we could have 
avoided most of the calculations in (16); the three conditions 2 n4—1 modri 4 = 
gcd( 2^ n4 '~ 1 ^ 23 — 1 , n 4 ) = gcd( 2^ n4—1 )/ 1973 — i ? n 4 ) == 1 are sufficient by them¬ 
selves to prove that n 4 is prime. Brillhart, Lehmer, and Selfridge have in fact 
developed a method that works when the numbers n — 1 and n -\- 1 have been 
only partially factored [Math. Comp. 29 (1975), 620-647, Corollary 11]: Suppose 
n — 1 = f~r~ and n + 1 = / + r + , where we know the complete factorizations 
of f~ and f + , and we also know that all factors of r~ and r + are > b. If the 
product ( b 3 f~f + max(/ _ , / + )) is greater than 2 n, a small amount of additional 
computation, described in their paper, will determine whether or not n is prime. 
Therefore numbers of up to 35 digits can usually be tested for primality in 2 or 
3 seconds, simply by casting out all prime factors < 30030 from n^l [see J. L. 
Selfridge and M. C. Wunderlich, Proc. Fourth Manitoba Conf. Numer. Math. 
(1974), 109-120]. The partial factorization of other quantities like n 2 ^ n + 1 
and n 2 - j- 1 can be used to improve this method still further [see H. C. Williams 
and J. S. Judd, Math. Comp. 30 (1976), 157-172, 867-886]. 

In practice, when n has no small prime factors and 3 n—1 modn = 1, it has 
almost always turned out that n is prime. (One of the rare exceptions in the 
author’s experience is n = y( 2 28 — 9) = 2341 • 16381.) On the other hand, 
some nonprime values of n are definitely bad news for the primality test we have 
discussed, because it might happen that x n ~ 1 modn = 1 for all x relatively 
prime to n (see exercise 9). One such number is n = 3-11-17 = 561; here 
X(n) = lcm(2,10,16) = 80 in the notation of Eq. 3.2.1.2-9, so x 80 mod 561 = 
1 = x 560 mod 561 whenever x is relatively prime to 561. Our procedure would 
repeatedly fail to show that such an n is prime, until we had stumbled across 
one of its divisors. To improve the method, we need a quick way to determine 
the nonprimality of nonprime n, even in such pathological cases. 
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The following surprisingly simple procedure is guaranteed to do the job with 
high probability: 

Algorithm P ( Probabilistic primality test). Given an odd integer n, this algorithm 
attempts to decide whether or not n is prime. By repeating the algorithm 
several times, as explained in the remarks below, it is possible to be extremely 
confident about the primality of n, in a precise sense, yet the primality will not 
be rigorously proved. Let n = 1 + 2 k q, where q is odd. 

PI. [Generate x.] Let x be a random integer in the range 1 < x < n. 

P2. [Exponentiate.] Set j <— 0 and y <— x q mod n. (As in our previous primality 
test, a ; 9 modn should be calculated in O(logq) steps, cf. Section 4.6.3.) 

P3. [Done?] (Now y = x 2 ° q modn.) If j = 0 and y = 1, or if y = n — 1, 
terminate the algorithm and say “n is probably prime.” If j > 0 and y = 1, 
go to step P5. 

P4. [Increase j.] Increase j by 1. If j < k, set y <— y 2 modn and return to step 
P3. 

P5. [Not prime.] Terminate the algorithm and say that “n is definitely not 
prime.” | 

The idea underlying Algorithm P is that ifn = 1 + 2 fc q is prime and 
a; 9 modn 7 ^ 1 , the sequence of values 

x 9 modn, z 29 modn, x 4q modn, ..., z^modn 

will end with 1 , and the value just preceding the first appearance of 1 will be 
n — 1. (The only solutions to y 2 = 1 (modulo p) are y = ^ 1 , when p is prime, 
since ( y — 1 )(y -f- 1 ) must be a multiple of p.) 

Exercise 22 proves the basic fact that Algorithm P will be wrong at most 
\ of the time, for all n. Actually it will rarely fail at all, for most n; but the 
crucial point is that the probability of failure is bounded regardless of the value 
of n. 

Suppose we invoke Algorithm P repeatedly, choosing x independently and 
at random whenever we get to step PI. If the algorithm ever reports that n is 
nonprime, we can say that n definitely isn’t prime. But if the algorithm reports 
25 times in a row that n is “probably prime,” we can say that n is “almost surely 
prime.” For the probability is less than (1/4 ) 25 that such a 25-times-in-a-row 
procedure gives the wrong information about n. This is less than one chance in a 
quadrillion; even if we certified a billion different primes with such a procedure, 
the expected number of mistakes would be less than 1D0 ^ 000 . It’s much more 
likely that our computer has dropped a bit in its calculations, due to hardware 
malfunctions or cosmic radiations, than that Algorithm P has repeatedly guessed 
wrong! 

Probabilistic algorithms like this lead us to question our traditional stand¬ 
ards of reliability. Do we really need to have a rigorous proof of primality? 
For people unwilling to abandon traditional notions of proof, Gary L. Miller 
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has demonstrated that if y/n is not an integer for any integer r > 2 (this con¬ 
dition being easily checked), and if a certain well-known conjecture in number 
theory called the Generalized Riemann Hypothesis can be proved, then either 
n is prime or there is an x < 4(ln n) 2 such that Algorithm P will discover the 
nonprimality of n. [See J. Comp. System Sci. 13 (1976), 300-317. The constant 4 
in this upper bound is due to Peter Weinberger, whose paper on the subject is 
not yet published.] Thus, we would have a rigorous way to test primality in 
O(logn) 5 elementary operations, as opposed to a probabilistic method whose 
running time is O(logn) 3 . But one might well ask whether any purported proof 
of the Generalized Riemann Hypothesis will ever be as reliable as repeated ap¬ 
plication of Algorithm P on random x’s. 

A probabilistic test for primality was first proposed in 1974 by R. Solovay 
and V. Strassen, who devised the interesting but more complicated test described 
in exercise 23(b). [See SIAM J. Computing 6 (1977), 84-85; 7 (1978), 118.] 
Algorithm P is a simplified version of a procedure due to M. 0. Rabin, based 
in part on ideas of Gary L. Miller [cf. Algorithms and Complexity, ed. by J. F. 
Traub (New York: Academic Press, 1976), 35-36]. 

A completely different approach to primality testing was discovered in 1980 
by Leonard M. Adleman. His highly interesting method is based on the theory 
of algebraic integers, so it is beyond the scope of this book; but it leads to a 
non-probabilistic procedure that will decide the primality of any number of up 
to, say, 250 digits, in a few hours at most. [See L. M. Adleman and R. S. 
Rumely, to appear.] 

Factoring via continued fractions. The factorization procedures we have dis¬ 
cussed so far will often balk at numbers of 30 digits or more, and another idea is 
needed if we are to go much further. Fortunately there is such an idea; in fact, 
there were two ideas, due respectively to A. M. Legendre and M- Kraitchik, that 
D. H. Lehmer and R. E. Powers used to devise a new technique many years ago 
[Bull. Amer. Math. Soc. 37 (1931), 770-776]. However, the method was not used 
at the time because it was comparatively unsuitable for desk calculators. This 
negative judgment prevailed until the late 1960s, when John Brillhart found that 
the Lehmer-Powers approach deserved to be resurrected, since it was quite well 
suited to computer programming. In fact, he and Michael A. Morrison later de¬ 
veloped it into the champion of all known methods for factoring large numbers: 
Their program would handle typical 25-digit numbers in about 30 seconds, and 
40-digit numbers in about 50 minutes, on an IBM 360/91 computer [see Math. 
Comp. 29 (1975), 183-205]. In 1970 the method had its first triumphant success, 
discovering that 2 128 + 1 = 59649589127497217 • 5704689200685129054721. 

The basic idea is to search for numbers x and y such that 

x 2 = y 2 (modulo TV), 0 < x, y < TV, x ^ y, x -\- y yA N. (17) 

Fermat’s method imposes the stronger requirement x 2 —y 2 = TV, but actually 
the congruence (17) is enough to split TV into factors: It implies that TV is a 
divisor of x 2 — y 2 = (x — y)(x-\-y), yet TV divides neither x—y nor x-\-y; hence 
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gcd (N, x — y) and gcd(A r , x-\-y) are proper factors of N that can be found by 
the efficient methods of Section 4.5.2. 

One way tq discover solutions of (17) is to look for values of x such that 
x 2 = a (modulq N), for small values of \a\. As we will see, it is often a simple 
matter to piece together solutions of this congruence to obtain solutions of (17). 
Now if x 2 = a-^k Nd 2 f or some k and d, with small |a|, the fraction x/d is a good 
appr oxim ation to \fkN ; conversely, if x/d is an especially good approximation 
to }/kN, the difference \x 2 — kNd 2 | will be smal l. T his observation suggests 
looking at the continued fraction expansion of VkN, since we have seen (in 
Eq. 4.5.3-12 and exercise 4.5.3-42) that continued fractions yield good rational 
approximations. 

Continued fractions for quadratic irrationalities have many pleasant proper¬ 
ties, which are proved in exercise 4.5.3-12. The algorithm below makes use of 
these properties to derive solutions to the congruence 

x 2 = (—1 ) €o p e iPl 2 • > - p e m (modulo AT). (18) 

Here we use a fixed set of small primes pi = 2, p 2 — 3, ..., up to p m ; only 
primes p such that either p = 2 or (kN )( p ~ 1 ^ 2 modp < 1 should appear in this 
list, since other primes will never be factors of the numbers generated by the 
algorithm (see exercise 14). If ($i, e<ji> *ii> • • ■ > e m i), ..., (x r ,eo r ,ei r ,..., e mr ) 
are solutions of (18) such that the vector sum 

if 01> ^ll> • • • > ^ml) ' ' * ~f~ (^0r> ®lr j • • • > ^mr) — (2Sg, 2e^, . .. , 2e m ) (19) 

is even in each component, then 

x ~ (xi... x r ) mod N, y = ((—l) e op^.. .p^)mod AT (20) 

yields a solution to (17), except for the possibility that x = Condition (19) 
essentially says that the vectors are linearly dependent modulo 2, so we must 
have a solution to (19) if we have found at least m -j- 2 solutions to (18). 

Algorithm E (Factoring via continued fractions). Given a positive integer N 
and a positive integer k such that kN is not a perfect square, this algorithm 
attempts to discover solutions to the congruen ce (1 8) for fixed m, by analyzing 
the convergents of the continued fraction for VkN. (Another algorithm, which 
uses the outputs to discover factors of AT, is the subject of exercise 12.) 

EX. [Initialize.] Set D <- fcJV, R <- [\/5j, R’ <- 2 \R, V +-W <- R', V t- 1, 
V' <- D — J? 2 , P <— R, P' <- 1, A <— 0, S <- 0. (This algorithm 
follows the general pro cedu re of exercise 4.5.3-12, finding the continued 
fraction expansion of VkN. The variables U, U', V, V ', P, P', A , and S 
represent, respectively, what that exercise calls R -j- U n , R -j- U n — i, V n , 
V n — i, p n modN, p n —j mod N, A n , and nmod2. We will always have 

0 <V <U <R', 

so the highest precision is needed only for P and P'.) 
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E2. [Advance U, V, 5.] Set T <- V, V <- A{U' — U) + V', V' <-T, A <- [U/V J, 
U' ^U,U <-/?' — (C/mod V), S<-l — S. 

E3. [Factor V.] (Now we have P 2 — kNQ 2 = (—l) s y, for some Q relatively 
prime to P, by exercise 4.5.3—12(c).) Set (e 0 , e 1} ..., e m ) <- (S', 0,, 0), 
T <— V. Now do the following, for 1 < j < m: If Tmodjr, = 0, set 
T <— T/pj and <— tj -j- 1, and repeat this process until Tmodp^ ^ 0. 

E4. [Solution?] If T = 1, output the values (P, eo, e\, ..., e m ), which comprise a 
solution to (18). (If enough solutions have been generated, we may terminate 
the algorithm now.) 

E5. [Advance P, P' .] If V ^ 1, set T «- P, P <- (AP + P') m od^» P' T, 
and return to step E2. Otherwise the continued fraction process has started 
to repeat its cycle, except perhaps for S, so the algorithm terminates. The 
cycle will usually be so long that this doesn’t happen.) | 

We can illustrate the application of Algorithm E to relatively small numbers 

by considering the case N = 197209, k = 1, m = 3, pi = 2, p 2 — 3, p 3 = 5. 

The computation proceeds as follows: 



U 

V 

A 

P 

S 

T 

Output 

After El 

888 

1 

0 

444 

0 

— 


After E4 

876 

73 

12 

444 

1 

73 


After E4 

882 

145 

6 

5329 

0 

29 


After E4 

857 

37 

23 

32418 

1 

37 


After E4 

751 

720 

1 

159316 

0 

1 

159316 2 = +2 4 ■ 3 2 • 5 1 

After E4 

852 

143 

5 

191734 

1 

143 


After E4 

681 

215 

3 

131941 

0 

43 


After E4 

863 

656 

1 

193139 

1 

41 


After E4 

883 

33 

26 

127871 

0 

11 


After E4 

821 

136 

6 

165232 

1 

17 


After E4 

877 

405 

2 

133218 

0 

1 

133218 2 = +2° • 3 4 • 5 1 

After E4 

875 

24 

36 

37250 

1 

1 

37250 2 = —2 3 • 3 1 • 5° 

After E4 

490 

477 

1 

93755 

0 

53 



Continuing the computation gives 25 outputs in the first 100 iterations; in other 
words, the algorithm is finding solutions quite rapidly. But some of the solutions 
are trivial. For example, if the above computation were continued 13 more times, 
we would obtain the output 197197 2 = 2 4 • 3 2 ■ 5°, which is of no interest since 
197197 = —12. The first two solutions above are already enough to complete 
the factorization: We have found that 

(159316 • 133218) 2 = (2 2 • 3 3 • 5 1 ) 2 (modulo 197209); 

thus (17) holds with x = (159316 • 133218) mod 197209 = 126308, y = 540. By 
Euclid’s algorithm, gcd(126308 — 540,197209) = 199; hence we obtain the pretty 
factorization 


197209 = 199-991. 
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We can get some understanding of why Algorithm E factors large numbers 
so successfully by considering a heuristic analysis of its running time, following 
unpublished ideas that R. Schroeppel communicated to the author in 1975. Let 
us assume for convenience that k = 1. The number of outputs needed to produce 
a factorization of N will be roughly proportional to the number m of small primes 
being cast out. Each execution of step E3 takes about order m log N units of 
time, so the total running time will be roughly proportional to m 2 log N/P, 
where P is the probability of a successful output per iteration. If we make the 
conservative assumption that V is randomly distributed between 0 and 2 y/N, the 
probability P is (2y/N )~ 1 times the number of integers < 2 y/N whose prime 
factors are all in the set {pi,... ,p m }. Exercise 29 gives a lower bound for P, 
from which we conclude that the running time is at most of order 


2 y/N m 2 log AT 
m r /r\ 


where r = 


log 2^ 
log p m 


( 21 ) 


If we let lnra = JVlniVln IniV, we find that r = yj In N /In In N — 1 + 
0(log log log AT/log log N), assuming that p m — 0(m log m), so formula (21) 
reduces to 


exp(2\/(ln N)(\n In N) + 0((log Af) 1//2 (loglog N) 1//2 (logloglogAT))). 

Stating this another way, the running time of Algorithm E is expected to be 
at most under r easonably plausible assumptions, where the exponent 

e(N) 2y/\n\nN/\nN goes to 0 as N —► oo. 

When N is in a practical range, we should of course be careful not to take 
such asymptotic estimates too seriously. For example, if N = 10 50 we have 
AT 1 /" — (lg N) a when a ^ 4.75, and the same relation holds for a ^ 8.42 when 
N — lO 200 . The function AK^ has an order of growth that is sort of a cross 
between AT 1 / 0 ' and (lg N) a ; but all three of these forms are about the same, 
unless N is intolerably large. Extensive computational experiments by M. C. 
Wunderlich have shown that a well-tuned version of Algorithm E performs much 
better than our estimate would indicate [cf. Lecture Notes in Math. 751 (1979), 
328-342]; although 2>/ln In N/lnN « .41 when N = 10 50 , he obtained running 
times of about AT- 15 while factoring thousands of numbers in the range 10 13 < 
N < 10 42 . 

Algorithm E begins its attempt to factorize N by essentially replacing N 
by kN, and this is a rather curious way to proceed (if not downright stupid). 
Nevertheless, it turns out to be a good idea, since certain values of k will make 
the V numbers potentially divisible by more small primes, hence they will be 
more likely to factor completely in step E3. On the other hand, a large value 
of k will make the V numbers larger, hence they will be less likely to factor 
completely; we want to balance these tendencies by choosing k wisely. Consider, 
for example, the divisibility of V by powers of 5. We have P 2 — kNQ 2 = 
(—1) S V in step E3, so if 5 divides V we have P 2 = kNQ 2 (modulo 5). In this 
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congruence Q cannot be a multiple of 5, since it is relatively prime to P, so we 
may write ( P/Q) 2 = kN (modulo 5). If we assume that P and Q are random 
relatively prime integers, so that the 24 possible pairs ( P mod 5, Q mod 5) 7 ^ (0, 0 ) 
are equally likely, the probability that 5 divides V is therefore 0 , 0 , or 

fz according as fciVmodS is 0 , 1 , 2 , 3, or 4. Similarly the probability that 25 
divides is 0, 0, 0, ^ respectively, unless kN is a multiple of 25. In 

general, given an odd prime p with (kN)( p ~ modp = 1 , we find that V is a 
multiple of p e with probability 2 /(p e— 1 {p-\- 1 )); and the average number of times 
p divides V comes to 2p/(p 2 — 1 ). This analysis, suggested by R. Schroeppel, 
suggests that the best choice of k is the value that maximizes 

/(P > kN ) lo S P ~~ i lo S k > ( 22 ) 

p prime 

where / is the function defined in exercise 28 and the sum is over all primes less 
than or equal to p m , for this is essentially the expected value of the logarithm 
of y/N/T when we reach step E4. 

Best results will be obtained with Algorithm E when both k and m are well 
chosen. The proper choice of m can only be made by experimental testing, since 
the asymptotic analysis we have made is too crude to give sufficiently precise 
information, and since a variety of refinements to the algorithm tend to have 
unpredictable effects. For example, we can make an important improvement by 
comparing step E3 with Algorithm A: The factoring of V can stop whenever we 
find T modpj 7^ 0 and [T/pj\ < pj, since T will then be either 1 or prime. If T 
is a prime greater than p m (it will be at most -j-p m — 1 in such a case), we can 
still output (P, e 0 ,..., e m , T), since a complete factorization has been obtained. 
The second phase of the algorithm will use only those outputs whose prime 
T’s have occurred at least twice. This modification gives the effect of a much 
longer list of primes, without increasing the factorization time. Wunderlich’s 
experiments indicate that m ss 150 works well in the presence of this refinement, 
when N is in the neighborhood of 10 40 . 

Since step E3 is by far the most time-consuming part of the algorithm, 
Morrison, Brillhart, and Schroeppel have suggested several ways to abort this step 
when success becomes improbable: (a) Whenever T changes to a single-precision 
value, continue only if [T/pj\ > Pj and 3 T ~ 1 mod T 7 ^ 1 . (b) Give up if T is still 
> p^ after casting out factors < jq p m . (c) Cast out factors only up to p 5 , say, 
for batches of 100 or so consecutive V’s; continue the factorization later, but only 
on the V from each batch that has produced the smallest residual T. (Before 
casting out the factors up to ps, it is wise to calculate iVmodp{ 1 p^ 2 p^ 3 p{ 4 p{ 5 , 
where the /’s are small enough to make p{ l p{ 2 p{ z p{ A v{ h fit in single precision, but 
large enough to make N modp [ i+1 = 0 unlikely. One single-precision remainder 
will therefore characterize the value of N modulo five small primes.) 

For estimates of the cycle length in the output of Algorithm E, see D. R. 
Hickerson, Pacific J. Math. 46 (1973), 429-432; D. Shanks, Proc. Boulder Number 
Theory Conference (Univ. of Colorado: 1972), 217-224. 
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Other approaches. A completely different method of factorization, based on 
composition of binary quadratic forms, has been introduced by Daniel Shanks 
[Proc. Symp. Pure Math. 20 (1971), 415-440]. Like Algorithm B, it will factor a 
given N in 0(N^ l ^ 4 ^ e ) steps except under wildly improbable circumstances. 

Still another important technique has been suggested by John M. Pollard 
[Proc. Cambridge Phil Soc. 76 (1974), 521-528]. He shows in essence that each 
prime factor < m can be found in Of^m (log miV) 4 ) steps, with a probabil¬ 
istic algorithm that uses a random x as in Algorithm P and performs suitable 
convolutions. In this paper Pollard also gives a practical algorithm for discover¬ 
ing prime factors p of N when p — 1 has no large prime factors. The latter 
algorithm (see exercise 19) is probably the first thing to try after Algorithms A 
and B have run too long on a large N. 

A survey paper by R. K. Guy, written in collaboration with J. H. Conway, 
Proc. Fifth Manitoba Conf. Numer. Math. (1975), 49-89, gives a unique perspec¬ 
tive on these developments. 

*A theoretical upper bound. From the standpoint of computational complexity, 
we would like to know if there is any method of factorization whose expected 
running time can be proved to be 0(N € ^), where e(N) —► 0 as N —► oo. We 
showed that Algorithm E probably has such behavior, but it seems hopeless 
to find a rigorous proof, because continued fractions are not sufficiently well 
disciplined. The first proof that a good factorization algorithm exists in this 
sense was discovered by John Dixon in 1978; Dixon showed, in fact, that it 
suffices to consider a simplified version of Algorithm E, in which the continued 
fraction apparatus is removed but the basic idea of (17) remains. 

Dixon’s method is simply this, assuming that N is known to have at least 
two distinct prime factors, and that N is not divisible by the first m primes 
Pi, P 2 i Pm : Choose a random integer X in the range 0 < X < N, and 
let V = X 2 mod N . If V = 0, the number gcd(X, N) is a proper factor of N. 
Otherwise cast out all of the small prime factors of V as in step E3; in other 
words, express V in the form 


V = p\\..p%T, 

where T is not divisible by any of the first m primes. If T — 1, the algorithm 
proceeds as in step E4 to output (X, e \>..., e m ), which represents a solution to 
(18) with eo = 0. This process continues with new random values of X until 
there are sufficiently many outputs to discover a factor of N by the method of 
exercise 12. 

In order to analyze this algorithm, we want to find bounds on (a) the 
probability that a random X will yield an output, and (b) the probability that a 
large number of outputs will be required before a factor is found. Let P(m, N) be 
the probability (a), i.e., the probability that T = 1 when X is chosen at random. 
After M values of X have been tried, we will obtain MP(m, N) outputs, on the 
average; and the number of outputs has a binomial distribution, so the standard 
deviation is less than the square root of the mean. The probability (b) is fairly 
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easy to deal with, since exercise 13 proves that the algorithm needs more than 
m-\-k outputs with probability <2~ k . 

Exercise 30 proves that P(m, N ) > m r /(r\N) when r = 2[logAT/(21ogp m )J, 
so we can estimate the running time as we did in (21) but with the quantity 2 \/N 
replaced by N. This time we choose In m — y^ln A/Tn In N)/2, so that 


r 


2 lniV _ 1 , ^ AogloglogiV \ 
lnln N v log log AT / 


= ex P(—\/2 In A/Tn In N -f- 0(r log log log IV)). 


We will choose M so that Mm r /(r\N ) > 4m; thus the expected number of 
outputs MP(m, N ) will be at least 4m. The running time of the algorithm is 
of order Mm log N, plus 0(m 3 ) steps for exercise 12; it turns out that 0(m 3 ) is 
less than Mm log A/ - , which is 


ex p(\/8(ln AT)(ln In N) + 0((log W) ly/2 (loglog N) 1 / 2 (logloglog A r ))). 

The probability that this method fails to find a factor is negligibly small, since 
the probability is at most e~ m / 2 that fewer than 2m outputs are obtained (see 
exercise 31), while the probability is at most 2~ m that no factors are found 
from the first 2m outputs, and m InN. We have proved the following slight 
strengthening of Dixon’s original theorem: 

Theorem D. There is a n algorithm whose running time is 0(N e ^), where 
e(N) = c\J In In AT/ln N and c is any constant greater than y/s, that finds a 
nontrivial factor of N with probability 1 — 0(1/N), whenever N has at least 
two distinct prime divisors . | 

Secret factors. Worldwide interest in the problem of factorization increased 
dramatically in 1977, when R. L. Rivest, A. Shamir, and L. Adleman discovered 
a way to encode messages that can apparently be decoded only by knowing the 
factors of a large number N, even though the method of encoding is known 
to everyone. Since a significant number of the world’s greatest mathematicians 
have been unable to find efficient methods of factoring, this scheme [CACM 21 
(1979), 120-126] almost certainly provides a secure way to protect confidential 
data and communications in computer networks. 

Let us imagine a small electronic device called an RSA box that has two 
large prime numbers p and q stored in its memory. We will assume that p— 1 and 
q — 1 are not divisible by 3. The RSA box is connected somehow to a computer, 
and it has told the computer the product N = pq\ however, no human being 
will be able to discover the values of p and q except by factoring N, since the 
RSA box is cleverly designed to self-destruct if anybody tries to tamper with it. 
In other words, it will erase its memory if it is jostled or if it is subjected to any 
radiation that could change or read out the data stored inside. Furthermore, the 
RSA box is sufficiently reliable that it never needs to be maintained; we simply 
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would discard it and buy another, if an emergency arose or if it wore out. The 
prime factors p and q were generated by the RSA box itself, using some scheme 
based on truly random phenomena in nature like cosmic rays. The important 
point is that nobody knows p or q, not even a person or organization that owns 
or has access to this RSA box; there is no point in bribing or blackmailing anyone 
or holding anybody hostage in order to discover N’s factors. 

To send a secret message to the owner of an RSA box whose product number 
is N, you break the message up into a sequence of numbers (xi, ..., x fc), where 
each Xi is between 0 and iV, then you transmit the numbers 

(x\ mod N, ..., x\ mod N). 

The RSA box, knowing p and q, can decode the message, because it has precom¬ 
puted a number d < N such that 3d = 1 (modulo (p — 1 )(q — 1)); it can now 
compute (x 3 ) d modN = x in a reasonable amount of time, using the method 
of Section 4.6.3. Naturally the RSA box keeps this magic number d to itself; in 
fact, the RSA box might choose to remember only d instead of p and q, because 
its only duties after having computed N are to protect its secrets and to take 
cube roots mod N. 

If x < \N, the above encoding scheme is ineffective, since x 3 mod N = x 3 
and the cube root will easily be found. The logarithmic law of leading digits in 
Section 4.2.4 implies that the leading place X\ of a fc-place message (x\, ..., x fc ) 
will be less than \fN about J of the time, so this is a problem that needs to be 
resolved. Exercise 32 presents one way to do this. 

The security of the RSA encoding scheme relies on the fact that nobody 
has been able to discover how to take cube roots mod N without knowing IV’s 
factors. It seems likely that no such method will be found, but we cannot be 
absolutely sure. So far all that can be said for certain is that all of the ordinary 
ways to discover cube roots will fail. For example, there is essentially no point 
in trying to compute the number d as a function of N; the reason is that if 
d is known, or in fact if any number m of reasonable size is known such that 
x m mod N = 1 holds for a significant number of x’s, then we can find the factors 
of N in a few more steps (see exercise 34). Thus, any method of attack based 
explicitly or implicitly on finding such an m can be no better than factoring. 

The numbers p and q shouldn’t merely be “random” primes in order to 
make the RSA scheme effective. We have mentioned that p — 1 and q — 1 
should not be divisible by 3, since we want to ensure that unique cube roots 
exist modulo N. Another condition is that p — 1 should have at least one very 
large prime factor, and so should q — 1; otherwise N can be factored using the 
algorithm of exercise 19. In fact, that algorithm essentially relies on finding a 
fairly small number m with the property that x m mod N is frequently equal to 1, 
and we have just seen that such an m is dangerous. When p — 1 and q — 1 
have large prime factors p\ and q i} the theory in exercise 34 implies that m is 
either a multiple of p\q± (hence m will be hard to discover) or the probability 
that x m = 1 will be less than 1/piQi (hence x m mod N will almost never be 1). 
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Besides this condition, we don’t want p and q to be close to each other, lest 
Algorithm D succeed in discovering them; in fact, we don’t want the ratio p/q 
to be near a simple fraction, otherwise Lehman’s generalization of Algorithm C 
could find them. 

The following procedure for generating p and q is almost surely unbreakable: 
Start with a truly random number p Q between, say, 10 80 and 10 81 . Search for the 
first prime number p 1 greater than p 0 ; this will require testing about J lnp 0 ~ 
90 odd numbers, and it will be sufficient to have p 1 a “probable prime” with 
probability >1 — 2~ 100 after 50 trials of Algorithm P. Then choose another 
truly random number p 2 between, say, 10 39 and 10 4 °. Search for the first prime 
number p of the form kpi -|- 1 where k > p 2 , k is even, and k = pi (modulo 3). 
This will require testing about ^ In pip 2 ^ 45 numbers before a prime p is found. 
The prime p will be about 120 digits long; a similar construction can be used to 
find a prime q about 130 digits long. For extra security, it is probably advisable 
to check that neither p + 1 nor q -j- 1 consists entirely of rather small prime 
factors (see exercise 20). The product N — pq, whose order of magnitude will 
be about 10 25 °, now meets all of our requirements, and it is inconceivable at this 
time that such an N could be factored. 

For example, suppose we knew a method that could factor a 250-digit 
number N in N 0A microseconds. This amounts to 10 25 microseconds, and there 
are only 31,556,952,000,000 ^s per year, so we would need more than 3 X 10 11 
years of CPU time to complete the factorization. Even if a government agency 
purchased 10 billion computers and set them all to working on this problem, it 
would take more than 31 years before one of them would crack N into factors; 
meanwhile the fact that the government had purchased so many specialized 
machines would leak out, and people would start using 300-digit JV’s. 

Since the encoding method x i—► x 3 mod N is known to everyone, there are 
additional advantages besides the fact that the code can be cracked only by the 
RSA box. Such “public key” systems were first considered by W. Diffie and 
M. E. Heilman in IEEE Trans. IT-22 (1976), 644-651. As an example of what 
can be done when the encoding method is public knowledge, suppose that Alice 
wants to communicate with Bill via electronic mail, and suppose each of them 
wants the letters to be signed so that the receiver can be sure that nobody else 
is forging any messages. Let Ea(M) be the encoding function for messages M 
sent to Alice, let Da{M) be the decoding done by Alice’s RSA box, and let 
Eb(M), Db(M) be the corresponding encoding and decoding functions for Bill’s 
RSA box. Then Alice can send a signed message by affixing her name and the 
date to some confidential message, then transmitting E b (Da(M)) to Bill, using 
her machine to compute Da{M). When Bill gets this message, his RSA box 
converts it to Da(M), and he knows Ea so he can compute M = Ea(Da{M)). 
This should convince him that the message did indeed come from Alice; nobody 
else could have sent the message Da(M). 

We might ask, how do Alice and Bill know each other’s encoding functions 
E a and E b ? It wouldn’t do simply to have them stored in a public file, since some 
Charlie could tamper with that file, substituting an N that he has computed 
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by himself; Charlie could then surreptitiously intercept and decode a private 
message before Alice or Bill would discover that something is amiss. The solution 
is to keep the product numbers Na and N B in a special public directory that 
has its own RSA box and its own widely publicized product number N&. When 
Alice wants to know how to communicate with Bill, she asks the directory for 
Bill’s product number; the directory computer sends her a signed message giving 
the value of Nb- Nobody can forge such a message, so it must be legitimate. 

An interesting alternative to the RSA scheme has been proposed by Michael 
Rabin [MIT Lab. for Comp. Sci., report TR-212 (1979)], who suggests encod¬ 
ing by the function x 2 modiV instead of x 3 modiV. In this case the decoding 
mechanism, which we can call a SQRT box, returns four different messages; the 
reason is that four different numbers have the same square modulo N, namely 
x, —x, fxmodN, and {—fx) mod AT, where / = ( p q ~ 1 — q p—1 )mod AT. If we 
agree in advance that x is even, or that x < JN, then the ambiguity drops to 
two messages, presumably only one of which makes any sense. The ambiguity 
can in fact be eliminated entirely, as shown in exercise 35. Rabin’s scheme has 
the important property that it is provably as difficult to find square roots mod N 
as to find the factorization N = pq; for by taking the square root of x 2 mod N 
when x is chosen at random, we have probability \ of finding a value y such that 
x 2 = y 2 and x ^ iy, after which gcd(x, y) = p or q. However, the system has a 
fatal flaw that does not seem to be present in the RSA scheme (see exercise 33): 
Anyone with access to a SQRT box can easily determine the factors of its N. 
This not only permits cheating by dishonest employees, or threats of extortion, 
it also allows people to reveal their p and q, after which they might claim that 
their “signature” on some transmitted document was a forgery. Thus it is clear 
that the goal of secure communication leads to subtle problems quite different 
from those we usually face in the design and analysis of algorithms. 

The largest known primes. We have discussed several computational methods 
elsewhere in this book that require the use of large prime numbers, and the 
techniques just described can be used to discover primes of up to, say, 25 digits 
or fewer, with relative ease. Table 1 shows the ten largest primes that are less 
than the word size of typical computers. (Some other useful primes appear in 
the answer to exercise 4.6.4-57.) 

Actually much larger primes of special forms are known, and it is occasion¬ 
ally important to find primes that are as large as possible. Let us therefore con¬ 
clude this section by investigating the interesting manner in which the largest 
explicitly known primes have been discovered. Such primes are of the form 
2 n — 1, for various special values of n, and so they are especially suited to certain 
applications of binary computers. 

A number of the form 2 n — 1 cannot be prime unless n is prime, since 2 UV — 1 
is divisible by 2 U — 1. In 1644, Marin Mersenne astonished his contemporaries by 
stating, in essence, that the numbers 2 P — 1 are prime for p — 2,3, 5, 7, 13, 17, 
19, 31, 67, 127, 257, and for no other p less than 257. (This statement appeared 
in connection with a discussion of perfect numbers in the preface to his Cogitata 
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Table 1 

USEFUL PRIME NUMBERS 


N 

01 

a 2 

<23 

a4 

0 5 

0/Q 

a-r 

08 

Og 

OlO 

2 15 

19 

49 

51 

55 

61 

75 

81 

115 

121 

135 

2 16 

15 

17 

39 

57 

87 

89 

99 

113 

117 

123 

2 17 

1 

9 

13 

31 

49 

61 

63 

85 

91 

99 

2 18 

5 

11 

17 

23 

33 

35 

41 

65 

75 

93 

2 19 

1 

19 

27 

31 

45 

57 

67 

69 

85 

87 

2 20 

3 

5 

17 

27 

59 

69 

129 

143 

153 

185 

2 21 

9 

19 

21 

55 

61 

69 

105 

111 

121 

129 

2 22 

3 

17 

27 

33 

57 

87 

105 

113 

117 

123 

2 23 

15 

21 

27 

37 

61 

69 

135 

147 

157 

159 

2 24 

3 

17 

33 

63 

75 

77 

89 

95 

117 

167 

2 25 

39 

49 

61 

85 

91 

115 

141 

159 

165 

183 

2 26 

5 

27 

45 

87 

101 

107 

111 

117 

125 

135 

2 27 

39 

79 

111 

115 

135 

187 

199 

219 

231 

235 

2 28 

57 

89 

95 

119 

125 

143 

165 

183 

213 

273 

2 29 

3 

33 

43 

63 

73 

75 

93 

99 

121 

133 

2 30 

35 

41 

83 

101 

105 

107 

135 

153 

161 

173 

2 31 

1 

19 

61 

69 

85 

99 

105 

151 

159 

171 

2 32 

5 

17 

65 

99 

107 

135 

153 

185 

209 

267 

2 33 

9 

25 

49 

79 

105 

285 

301 

303 

321 

355 

2 34 

41 

77 

113 

131 

143 

165 

185 

207 

227 

281 

2 35 

31 

49 

61 

69 

79 

121 

141 

247 

309 

325 

2 36 

5 

17 

23 

65 

117 

137 

159 

173 

189 

233 

2 37 

25 

31 

45 

69 

123 

141 

199 

201 

351 

375 

2 38 

45 

87 

107 

131 

153 

185 

191 

227 

231 

257 

2 39 

7 

19 

67 

91 

f 135 

165 

219 

231 

241 

301 

2 40 

87 

167 

195 

203 

213 

285 

293 

299 

389 

437 

2 41 

21 

31 

55 

63 

73 

75 

91 

111 

133 

139 

2 42 

11 

17 

33 

53 

65 

143 

161 

165 

215 

227 

2 43 

57 

67 

117 

175 

255 

267 

291 

309 

319 

369 

2 44 

17 

117 

119 

129 

143 

149 

287 

327 

359 

377 

2 45 

55 

69 

81 

93 

121 

133 

139 

159 

193 

229 

2 46 

21 

57 

63 

77 

167 

197 

237 

287 

305 

311 

2 47 

115 

127 

147 

279 

297 

339 

435 

541 

619 

649 

2 48 

59 

65 

89 

93 

147 

165 

189 

233 

243 

257 

2 59 

55 

99 

225 

427 

517 

607 

649 

687 

861 

871 

2 60 

93 

107 

173 

179 

257 

279 

369 

395 

399 

453 

2 63 

25 

165 

259 

301 

375 

387 

391 

409 

457 

471 

2 64 

59 

83 

95 

179 

189 

257 

279 

323 

353 

363 

10 6 

17 

21 

39 

41 

47 

69 

83 

93 

117 

137 

10 7 

9 

27 

29 

57 

63 

69 

71 

93 

99 

111 

10 8 

11 

29 

41 

59 

69 

153 

161 

173 

179 

213 

10 9 

63 

71 

107 

117 

203 

239 

243 

249 

261 

267 

10 10 

33 

57 

71 

119 

149 

167 

183 

213 

219 

231 

10 11 

23 

53 

57 

93 

129 

149 

167 

171 

179 

231 

10 12 

11 

39 

41 

63 

101 

123 

137 

143 

153 

233 

10 16 

63 

83 

113 

149 

183 

191 

329 

357 

359 

369 


The ten largest primes less than N are N — ai, ..., N — aio- 
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Physico-Mathematics. Curiously, he also made the following remark: “To tell 
if a given number of 15 or 20 digits is prime or not, all time would not suffice 
for the test, whatever use is made of what is already known.”) Mersenne, who 
had corresponded frequently with Fermat, Descartes, and others about similar 
topics in previous years, gave no proof of his assertions, and for over 200 years 
nobody knew whether he was correct or not. Euler showed that 2 31 — 1 is 
prime in 1772, after having tried unsuccessfully to prove this in previous years. 
About 100 years later, E. Lucas discovered that 2 127 — 1 is prime, but 2 67 — 1 
is not; therefore Mersenne was not completely accurate. Then I. M. Pervushin 
proved in 1883 that 2 61 — 1 is prime [cf. Istoriko-Mat. Issledovantia 6 (1953), 
559], and this touched off speculation that Mersenne had only made a copying 
error, writing 67 for 61. Eventually other errors in Mersenne’s statement were 
discovered; R. E. Powers [AMM 18 (1911), 195] found that 2 89 — 1 is prime, as 
had been conjectured by some earlier writers, and three years later he proved 
that 2 107 — 1 also is prime. M. Kraitchik showed in 1922 that 2 257 — 1 is not 
prime [cf. Recherches sur la Theorie des Nombres (Paris: 1924), 21]. 

At any rate, numbers of the form 2 P — 1 are now called Mersenne numbers, 
and it is known that the first 27 Mersenne primes are obtained for p equal to 

2, 3, 5, 7, 13, 17, 19, 31, 61, 89, 107, 127, 521, 607, 1279, 2203 2281, 

- - 3217, 4253, 4423, 9689, 9941, 11213, 19937, 21701, 23209, 44497, ^ ' 

The 24th of these was found by Bryant Tuckerman [Proc. Nat. Acad. Sci. 68 
(1971), 2319-2320], and the 25th was found in 1978 by Laura Nickel and Curt 
Noll. Shortly afterwards, Noll found the 26th, and David Slowinski harnessed a 
CRAY-I computer to the task of discovering the 27th; see J. Recreational Math. 
11 (1979), 258-261. Note that the prime 8191 = 2 13 — 1 does not occur in 
(23); Mersenne had stated that 2 8191 — 1 is prime, and others had conjectured 
that any Mersenne prime could perhaps be used in the exponent. 

Since 2 44497 - 1 is a 13,395-digit number, it is clear that some special 
techniques have been used to prove that it is prime. An efficient way to test 
the primality of a given Mersenne number 2 P — 1 was first devised by E. Lucas 
[Amer. J. Math. 1 (1878), 184-239, 289-321, especially p. 316] and improved 
by D. H. Lehmer [Annais of Math. 31 (1930), 419-448, especially p. 443]. The 
Lucas-Lehmer test, which is a special case of the method now used for testing 
the primality of n when the factors of n -]- 1 are known, is the following: 

Theorem L. Let q be an odd prime , and define the sequence{L n } by the rule 

L 0 = 4, L n+1 = (L 2 - 2) mod ( 2 q - 1). (24) 

Then 2 q — 1 is prime if and only if L q — 2 = 0. 

For example, 2 3 — 1 is prime since Li = (4 2 — 2) mod 7 = 0. This test is 
particularly well suited to binary computers, using multiple-precision arithmetic 
when q is large, since calculation mod (2 q — 1) is so convenient; cf. Section 4.3.2. 
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Proof. We will prove Theorem L using only very simple principles of number 
theory, by investigating several features of recurring sequences that are of inde¬ 
pendent interest. Consider the sequences (U n ) and (V n ) defined by 

Uq = 0, U\ = 1, U n -\-i = 4U n — U n — i; 

Vo = 2, Vi = 4, 14 + i=4K-K_ 1 . 1 } 

The following equations are readily proved by induction: 


Vn — C/n-f-1 — Un—U (26) 

U n = ((2 + \/3) n — (2 — \/3)")/\/l2; (27) 

V n = (2 + \/3) n + (2 — \/3) n ; (28) 

Hm+n —— Hm^n+1 — lU n . (29) 


Let us now prove an auxiliary result, when p is prime and e > 1: 

if JJ n = 0 (modulo p e ) then U np = 0 (modulo p e+1 ). (30) 

This follows from the more general considerations of exercise 3.2.2-11, but a 
direct proof for sequence (25) can be given. Assume that U n = bp e , U n +\ = a. 
By (29) and (25), U 2n = bp e (2a — 4 bp e ) = (2 a)U n (modulo p e_t_1 ), while we have 
U 2n +i = U 2 n+1 -U 2 n = a 2 . Similarly, C/ 3n - = (3a 2 )t/ n 

and U 3n + i = L^n+i^n+i — U 2n U n = a 3 . In general, 

U kn = (/ca fe_1 )7/ n and C4 n +i = a k (modulo p e+1 ), 

so (30) follows if we take k = p. 

From formulas (27) and (28) we can obtain other expressions for U n and V n , 
expanding (2 by the binomial theorem: 

C/ " = E( 2fc + 1 ) 2n_2 ' C_l3 ' : > ^ = E( 2 n fc )2 n - 2fc+1 3 fc . (31) 

Now if we set n = p, where p is an odd prime, and if we use the fact that (£) is 
a multiple of p except when k = 0 or k = p, we find that 

Up = 3^ p—1 ^ 2 , V p = 4 (modulo p). (32) 

If p 3, Fermat’s theorem tells us that 3 P—1 = 1; hence — 1)X 

(3 (p— !)/ 2 _|_ i) = o, and 3 (p — V /2 = _|_i. When U p = —1, we have C7 p +i = 
4 U p — U p —i — iU p + V p — U p +1 = —U p + 1 ; hence U v +1 modp = 0. When 
Up = +1) we have U p —\ = 4 U p — U p +i = AU p — Vp — U p —i = —U p — i; hence 
7/p_i modp = 0. We have proved that, for all primes p, there is an integer e(p) 
such that 


C/p_l_ c (p) mod p = 0, 


e(p)l < 1- 


(33) 
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Now if N is any positive integer, and if m = m(N) is the smallest positive 
integer such that U m (N) mod N = 0, we have 

U n mod N = 0 if and only if n is a multiple of m(N). (34) 

(This number m(N) is called the “rank of apparition” of N in the sequence.) 
To prove (34), observe that the sequence U m , U m - fi, U m + 2 , -..is congruent 
(modulo N ) to aUo, aU\, aU 2 , ..., where a = U m +i mod N is relatively prime 
to N because gcd(f7 n , £/ n+ i) = 1. 

With these preliminaries out of the way, we are ready to prove Theorem L. 
By (24) and induction, 

L n — V 2 n mod (2 q — 1). (35) 

Furthermore, the identity 2C/ n+1 = 4 U n + V n implies that gcd (U n ,V n ) < 2, 
since any common factor of U n and V n must divide U n and 2C/ n _j_ 1 , while 
gcd (U n , 1 ) = 1. So U n and V n have no odd factor in common, and if 
L q _ 2 = 0 we must have 

U 2 < ?-i = U 2 q- 2 V 2 Q -2 = 0 (modulo 2 q — 1), 

U 2q - 2 ^ 0 (modulo 2 q — 1). 

Now if m = m(2 q — 1) is the rank of apparition of 2 9 — 1, it must be a 
divisor of 2 q_1 but not of 2 9-2 ; thus m = 2 q_1 . We will prove that n = 2 q — 1 
must therefore be prime: Let the factorization of n be p* 1 .. .p® r . All primes pj 
are greater than 3, since n is odd and congruent to (—l) q — 1 = —2 (modulo 3). 
From (30), (33), and (34) we know that U t = 0 (modulo 2 q — 1), where 

t = lcm^j 1 1 (p 1 + £ 1 ), .... P e r r ~\p r + £r)), 

and each ej is ^1. It follows that t is a multiple of m — 2 q ~ 1 . Let no — 
ni< 3 '<rP, e,_l (Pi + e j); we have no < rii< 3 <rP, ei_1 (P 3 + \Pj) = (| ) r n. Also, 
because Pj -\-€j is even, t < no/2 r ~ 1 , since a factor of two is lost each time the 
least common multiple of two even numbers is taken. Combining these results, 
we have m <t < 2(f ) r n < 4(§ ) r m < 3m; hence r < 2 and t = m or t = 2m, 
a power of 2. Therefore e\ — 1, e r — 1, and if n is not prime we must have 
n — 2 q — 1 = (2^ -f-1)(2 / — 1) where 2 k + 1 and 2 l — 1 are prime. The latter 
factorization is obviously impossible when q is odd, so n is prime. 

Conversely, suppose that n — 2 q — I is prime; we must show that V 2 q -2 = 0 
(modulo n). For this purpose it suffices to prove that V 2 q-i = —2 (modulo n), 
since = (V^-*) 2 — 2. Now 

Vi,-. = ((V2 + V6)/2) n+1 + ((^2 - v'l)/2) n+1 

-2 ("J : E (”+‘) »*■ 
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Since n is prime, the binomial coefficient 



is divisible by n except when k = 0 and k = (n -f-1)/2; hence 

2 (n - 1 > /2 V 2 ,-i = 1 + 3 (n+1)/2 (modulo n). 


Here 2 = (2 fg+1 ^ /2 J 2 , so 2 {n ~ l)//2 = ( 2 < < i+ 1 )/ 2 j(>t— 1 ) = j by Fermat’s theorem. 
Finally, by a simple case of the law of quadratic reciprocity (cf. exercise 23), 
3 (n—i )/2 = — [ } s i nC e nmod3 = 1 and nmod4 = 3. This means V^- 1 = —2, 
so V 2 q -2 =0. | 

The world’s largest explicitly known prime numbers have always been Mer- 
senne primes, at least from 1772 until 1980 when this book was written. But the 
situation will probably change soon, since Mersenne primes are getting harder 
to find, and since exercise 27 presents an efficient test for primes of other forms. 


EXERCISES 

1. [10] If the sequence do, d\, d 2 , ... of trial divisors in Algorithm A contains a 
number that is not prime, why will it never appear in the output? 

2. [15] If it is known that the input N to Algorithm A is equal to 3 or more, could 
step A2 be eliminated? 

3. [ M20 ] Show that there is a number P with the following property: If 1000 < n < 
1000000, then n is prime if and only if gcd(n, P) — 1. 

4. [ M29 ] In the notation of exercise 3.1-7 and Section 1.2.11.3, prove that the 
average value of the least n such that X n = Xi( n )—i lies between 1.5 Q{m) — 0.5 
and 1.625 Q(m) — 0.5. 

5. [21] Use Fermat’s method (Algorithm D) to find the factors of 10541 by hand, 
when the moduli are 3, 5, 7, and 8. 

6. [M24\ If p is an odd prime and if N is not a multiple of p, prove that the number 
of integers x such that 0 < x < p and x 2 — N = y 2 (modulo p) has a solution y is 
equal to (p i l)/2. 

7. [25] Discuss the problems of programming the sieve of Algorithm D on a binary 
computer when the table entries for modulus m, do not exactly fill an integral number 
of memory words. 

► 8. [23] ( The “sieve of Eratosthenes,” 3rd century B.C.) The following procedure 
evidently discovers all odd prime numbers less than a given integer N, since it removes 
all the nonprime numbers: Start with all the odd numbers less than N; then successively 
strike out the multiples p 2 , pk(pk + 2), Pk{Pk +4), ..., of the A;th prime pk, for k = 2, 
3, 4, ..., until reaching a prime pk with p 2 k > N. 

Show how to adapt the procedure just described into an algorithm that is directly 
suited to efficient computer calculation, using no multiplication. 
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9. [ M25 ] Let n be an odd number, n > 3. Show that if the number \(n) of Theorem 
3.2.1.2B is a divisor of n — 1 but not equal to n — 1, then n must have the form 
P 1 P 2 - .. pt where the p’s are distinct primes and t > 3. 

► 10. [M26] (John Selfridge.) Prove that if, for each prime divisor p of n — 1 , there is 
a number x v such that jc£ n—1 ^ p modn 7 *= 1 but Xp ~ 1 modn = 1 , then n is prime. 

11 . [M20] What outputs does Algorithm E give when N — 197209, k = 5, m = 1 ? 
[Hint: >/5 • 197209 = 992 + /l, 495, 2,495,1,1984/.] 

► 12. ( M28 ] Design an algorithm that uses the outputs of Algorithm E to find a proper 
factor of N, provided that Algorithm E has produced enough outputs to deduce a 
solution of (17). 

13. [HM25 ] (J. D. Dixon.) Prove that whenever the algorithm of exercise 12 is pre¬ 
sented with a solution (x, eo,..., e m ) whose exponents are linearly dependent modulo 2 
on the exponents of previous solutions, the probability is 2 1 d that a factorization will 
be found, when n has d distinct prime factors and x is chosen at random. 

14. [M20] Prove that the number T in step E3 of Algorithm E will never be a multiple 
of an odd prime p for which (fclV )^ -1 ^ 2 modp > 1 . 

► 15. [M 84 ] (Lucas and Lehmer.) Let P and Q be relatively prime integers, and let 
Uo = 0 , Ui = 1, f/ n -f 1 = PU n — QU n —1 for n > 1 . Prove that if JV is a positive integer 
relatively prime to 2P 2 — 8 Q, and if I/n+i mod N = 0, while U(N+i)/p mod N 7 ^ 0 
for each prime p dividing N - J- 1 , then N is prime. (This gives a test for primality 
when the factors of N -(-1 are known instead of the factors of N — 1. We can evaluate 
Um in O(logm) steps; cf. exercise 4.6.3-26.) [Hint: See the proof of Theorem L.] 

16. [. M50 ] Are there infinitely many Mersenne primes? 

17. [ M25 ] (V. R. Pratt.) A complete proof of primality by the converse of Fermat’s 
theorem takes the form of a tree whose nodes have the form {q,x), where q and x 
are positive integers satisfying the following arithmetic conditions: (i) If (qi, Zi), ..., 
(qt, x t ) are the sons of {q, x) then q = q \... q t +1. [In particular, if {q, x) has no sons, 
then q = 2.] (ii) If (r, y) is a son of ( q , x), then x (q “ 1 ^ r modq 7 ^ 1 . (iii) For each node 
(q, x), we have x q ~ 1 mod q = 1. From these conditions it follows that q is prime and 
x is a primitive root modulo q , for all nodes (q, x). [For example, the tree 



demonstrates that 1009 is prime.] Prove that such a tree with root (q, x) has at most 
f(q) nodes, where / is a rather slowly growing function. 

► 18. [HM23] Give a heuristic proof of (7), analogous to the text’s derivation of (6). 
What is the approximate probability that pt —1 < \/pt ? 
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► 19. [ M25] (J. M. Pollard.) Show how to compute a number M that is divisible by 
all primes p such that p — 1 is a divisor of some given number D. [Hint: Consider 
numbers of the form a n — 1.] Such an M is useful in factorization, for by computing 
gcd (M, N ) we may discover a factor of N. Extend this idea to an efficient method that 
has high probability of discovering prime factors p of a given large number N, when 
all prime power factors of p — 1 are less than 10 3 except for at most one prime factor 
less than 10 5 . [For example, the second-largest prime dividing (14) would be detected 
by this method, since it is 1 + 2 4 • 5 2 • 67 • 107 ■ 199 ■ 41231.] 

20. [M 40 ] Consider exercise 19 with p -\-1 replacing p — 1. 

21. [ M49} (R. K. Guy.) Let m(p) be the number of iterations required by Algorithm B 
to cast out the prime factor p. Is m(p) = 0(\/p logp) as p —> 00? 

► 22. [ M30 ] (M. O. Rabin.) Let p n be the probability that Algorithm P guesses wrong, 
given n. Show that p n < \ for all n. 

23. [ M35] The Jacobi symbol (|) is defined to be —1, 0, or -[-1 for all integers p > 0 
and all odd integers q > 1 by the rules (|) = p (g—1)/2 (modulo q) when q is prime; 
(|) = (^)---(^) when q is the product qi ... q t of t primes (not necessarily distinct). 

a) Prove that (|) satisfies the following relationships, hence it can be computed effi- 
ciently: (?) = 0; (?) = 1; (|) = ( p -^); (|) = (-1)<« 2 -WS; (Eip) = 
(t)(t)J (?) — (—l) (p—1 ) <, “’>/ 4 (j) if both p and q are odd. [The latter law, 

which is a reciprocity relation reducing the evaluation of (|) to the evaluation of 
(|), has been proved in exercise 1.2.4-47(d) when both p and q are prime, so you 
may assume its validity in that special case.] 

b) (Solovay and Strassen.) Prove that if n is odd but not prime, the number of 
integers x such that 1 < x < n and 0 7^ (^) = x (n—1)/2 (modulo n) is at most 
%<p{n). (Thus, the following testing procedure correctly determines whether or not 
a given n is prime, with probability > ^ for all fixed n: “Generate x at random 

with 1 <x<n. If 07 ^(^) = x (n—1 ^ 2 (modulo n), say that n is probably prime, 
otherwise say that n is definitely not prime.”) 

c) (L. Monier.) Prove that if n and x are numbers for which Algorithm P concludes 
that “n is probably prime”, then 0 7^ (^) = x^ n—x ^ 2 (modulo n). [Hence 
Algorithm P is always superior to the test in (b).] 

► 24. [ M25 ] (L. Adleman.) When n > 1 and x > 1 are integers, n odd, let us say 
that n “passes the x test of Algorithm P” if either x = n or if steps P2-P5 lead to 
the conclusion that n is probably prime. Prove that, for any N, there exists a set of 
positive integers x \, ..., x m < AT with m < [lglVj such that a positive odd integer 
in the range 1 < n < N is prime if and only if it passes the x test of Algorithm P 
for x = xi modn, ..., x = x m modn. Thus, the probabilistic test for primality can in 
principle be converted into an efficient test that leaves nothing to chance. (You need 
not show how to compute the Xj efficiently; just prove that they exist.) 

25. [. HM41 ] (B. Riemann.) Prove that 

f e «+”-)l„* dt /(t + !T) + 0( 1), 

- OO 

where the sum is over all complex <7 -f ir such that r 7^ 0 and ^(o - 4- ir) = 0. 


7r(x) = J dt /In t -f- 



4 . 5.4 


FACTORING INTO PRIMES 397 


► 26. [ M25 ] (H. C. Pocklington, 1914.) Let N = fr + 1 > 1, where 0 < r < / -f-1* 
Prove that N is prime if, for every prime divisor p of /, there is an integer x such that 
x N ~ 1 mod N = gcd(x^ N— — 1, N) — 1. 

► 27 . [MSO] Show that there is a way to test numbers of the form 5-2 n -j -1 for primality, 
using the same amount of computer time as the Lucas-Lehmer test for Mersenne primes 
in Theorem L, except for an additional O(nlogn) seconds. 

28 . [M27} Given a prime p and a positive integer d, what is the value of f(p, d)> the 
average number of times p divides A 2 — dJ3 2 , when A and B are random integers that 
are independent except for the condition gcd(A, B) = 1? 

29 . [ M25 ] Prove that the number of positive integers < n whose prime factors 
are all contained in a given set of primes {pi,...,p m } is at least m r /r\, when r = 
[logn/logpmj and pi < • • • < p m . 

30. [HMS5] (J. D. Dixon and Claus-Peter Schnorr.) Let pi < • • • < p m be primes 
that do not divide the odd number N, and let r be an even integer < logiV/logp m . 
Prove that the number of integers X in the range 0 < X < N such that X 2 mod N = 
p e 1 1 ...p < £ 1 is at least m T jr\. [Hint: Let the prime factorization of N be tfi 1 .. .<?£*. 
Show that a sequence of exponents {e \,..., e m ) leads to 2 d solutions X whenever we 

have ci H-f- e m < r and pj 1 ... p^" is a quadratic residue modulo q* for 1 < i < d. 

Such exponent sequences can be obtained as ordered pairs (e^,..., e' m ; e",..., e^) where 
e'i H-h <4 < \ r and e'( -f-(- C < hr and 



l )/2 


(modulo q%) 


for 1 < i < d.] 

31 . [. M20 j Use formula 3.5-33 to show that the probability is less than e~ m ^ 2 that 
Dixon’s factorization algorithm (as described preceding Theorem D) obtains fewer than 
2 m outputs. 

► 32. [M21] Show how to modify the RSA encoding scheme so that there is no problem 
with messages < viV, in such a way that the length of messages is not substantially 
increased. 

33 . [M50] Prove or disprove: If a reasonably efficient algorithm exists that has a 

nonnegligible probability of being able to find x mod N, given a number N — pq whose 
prime factors satisfy p = 2 (modulo 3) and given the value of x 3 mod N , then there 

is a reasonably efficient algorithm that has a nonnegligible probability of being able to 
find the factors of N. [If this could be proved, it would not only show that the cube 
root problem is as difficult as factoring, it would also show that the RSA scheme has 
the same fatal flaw as the SQRT scheme.] 

34 . [MSO] (Peter Weinberger.) Suppose N = pq in the RSA scheme, and suppose 
you know a number m such that z m modiV = 1 for at least 10 ~ 12 of all positive 
integers x. Explain how you would go about factoring N without great difficulty, if m 
is not too large. 

► 35 . [M25] (H. C. Williams, 1979.) Let N be the product of two primes p and q, where 
pmod 8 — 3 and qmod 8 = 7. Prove that the Jacobi symbol satisfies (^f) = ($) = 
—(^), and use this to design an encoding/decoding scheme analogous to Rabin’s SQRT 
box but with no ambiguity of messages. 
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36. [HM24] The asymptotic analysis following (21) is too coarse to give meaningful 
values unless N is extremely large, since In In N is always rather small when AT is in a 
practical range. Carry out a more precise analysis that gives insight into the behavior 
of (21) for reasonable values of N; also explain how to choose a value of lnra that 
minimizes (21) except for a factor of size at most exp(0(loglogN)). 

37. [ M27] Prove that the square root of every positive integer D has a periodic 
continued fraction of the form 

VD — R-\- /ai ,... ,a n , 2 R, a x ,... ,a n , 2 R, a x ,..., a n , 2 R ,... /, 

unless D is a perfect square, where R — [Vd\ and [a \,..., a n ) is a palindrome (i.e., 
cii = dn+i—i for 1 < i < n). 

► 38. [. M36 ] (A. Shamir.) Consider an abstract computer that can perform the opera¬ 
tions x + y, x — y, x • y, and \x/y\ on integers x and y of arbitrary length, in just one 
unit of time, no matter how large those integers are. The machine stores integers in a 
random-access memory and it can select different program steps depending on whether 
or not x = y, given x and y. The purpose of this exercise is to demonstrate that there 
is an amazingly fast way to factorize numbers on such a computer. (Therefore it will 
probably be quite difficult to show that factorization is inherently complicated on real 
machines, although we suspect that it is.) 

a) Find a way to compute n! in 0(log n) steps on such a computer, given an integer 
value n > 2. [Hint: If A is a sufficiently large integer, the binomial coefficients 

— m\/(m — k)\ k\ can be computed readily from the value of (A -(- l) m .] 

b) Show how to compute a number /(n) in O(logn) steps on such a computer, given 
an integer value n > 2, having the following properties: f(n) — n if n is prime, 
otherwise f(n) is a proper (but not necessarily prime) divisor of n. [Hint; If n 7^ 4, 
one such function f(n) is gcd(m(n), n ), where m(n) = min{ m | m\ modn = 0 }.] 

(As a consequence of (b), we can completely factor a given number n by doing only 
O(logn) 2 arithmetic operations on arbitrarily large integers: Given a partial factoriza¬ 
tion n — n x ... n r , each nonprime rii can be replaced by f(n t ) ■ [rii /in a total of 
^T)O(logni) = O(logn) steps, and this refinement operation can be repeated until all 
rii are prime.) 


The problem of distinguishing prime numbers from composites, 
and of resolving composite numbers into their prime factors, 
is one of the most important and useful in all of arithmetic. 
... The dignity of science seems to demand that every aid to the solution 
of such an elegant and celebrated problem be zealously cultivated. 

—K. F. GAUSS, Disquisitiones Arithmetics, Art. 329 (1801) 
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4.6. POLYNOMIAL ARITHMETIC 

The TECHNIQUES we have been studying apply in a natural way to many 
different types of mathematical quantities, not simply to numbers. In this section 
we shall deal with polynomials, which are the next step up from numbers. 
Formally speaking, a polynomial over S is an expression of the form 


U(X) = U n X n H- \-UiX-\-Uq, (1) 

where the “coefficients” u n , ..., Ui, uo are elements of some algebraic system S, 
and the “variable” x may be regarded as a formal symbol with an indeterminate 
meaning. We will assume that the algebraic system S' is a commutative ring with 
identity; this means that S admits the operations of addition, subtraction, and 
multiplication, satisfying the customary properties: Addition and multiplication 
are associative and commutative binary operations defined on S, where multi¬ 
plication distributes over addition; and subtraction is the inverse of addition. 
There is an additive identity element 0 such that a —(— 0 = a, and a multiplica¬ 
tive identity element 1 such that a • 1 = a, for all a in S. The polynomial 
O^n-t-m _|- 1 _ 0a; n + 1 -(- u n x n -\ - \~ U\X -j- Wo is regarded as the same poly¬ 

nomial as (1), although its expression is formally different. 

We say that (1) is a polynomial of degree n and leading coefficient u n if 
u n 0; and in this case we write 

deg(u) = n, t(v) = u n . (2) 

By convention, we also set 

deg(0) = —oo, ^(0) = 0, (3) 

where “0” denotes the zero polynomial whose coefficients are all zero. We say 
that u(x) is a monic polynomial if its leading coefficient £(u) is 1. 

Arithmetic on polynomials consists primarily of addition, subtraction, and 
multiplication; in some cases, further operations such as division, exponentiation, 
factoring, and taking the greatest common divisor are important. The processes 
of addition, subtraction, and multiplication are defined in a natural way, as 
though the variable x were an element of S: Addition and subtraction are done 
by adding or subtracting the coefficients of like powers of x. Multiplication is 
done by the rule 


(u r x r H-(- u 0 )(v s x s -\ -(- ^o) = {w r+s x r + s H-f- W 0 ), 


where 

Wk = UoVk H“ U\Vk-~ l H-b u k— 1^1 Uk v o* (4) 

In the latter formula Ui or Vj are treated as zero if i > r or j > s. 

The algebraic system S is usually the set of integers, or the rational numbers; 
or it may itself be a set of polynomials (in variables other than re); in the latter 
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situation (1) is a multivariate polynomial, a polynomial in several variables. 
Another important case occurs when the algebraic system S consists of the 
integers 0, 1,..., m— 1, with addition, subtraction, and multiplication performed 
mod m (cf. Eq. 4.3.2-11); this is called polynomial arithmetic modulo m. The 
special case of polynomial arithmetic modulo 2, when each of the coefficients is 
0 or 1, is especially important. 

The reader should note the similarity between polynomial arithmetic and 
multiple-precision arithmetic (Section 4.3.1), where the radix b is substituted 
for x. The chief difference is that the coefficient Uk of x k in polynomial arithmetic 
bears little or no essential relation to its neighboring coefficients tifc±i> so the 
idea of “carrying” from one place to the next is absent. In fact, polynomial 
arithmetic modulo b is essentially identical to multiple-precision arithmetic with 
radix b, except that all carries are suppressed. For example, compare the 
multiplication of ( 1101)2 by ( 1011)2 in the binary number system with the 
analogous multiplication of x 3 -f- x 2 -f- 1 by x 3 -f z -f 1 modulo 2: 


Binary system 
1101 
X ion 
1101 
1101 
1101 


Polynomials modulo 2 

1101 
X 1011 
1101 
1101 
1101 


10001111 


1111111 


The product of these polynomials modulo 2 is obtained by suppressing all carries, 
and it is x^ -\-x b -\-x A -\-x z ^-x 2 -\-x-\-\. If we had multiplied the same polynomials 
over the integers, without taking residues modulo 2, the result would have been 
x 6 -f- x 5 -(- x 4 + 3z 3 x 2 x1) again carries are suppressed, but in this case 
the coefficients can get arbitrarily large. 

In view of this strong analogy with multiple-precision arithmetic, it is un¬ 
necessary to discuss polynomial addition, subtraction, and multiplication much 
further in this section. However, we should point out some factors that often 
make polynomial arithmetic somewhat different from multiple-precision arith¬ 
metic in practice: There is often a tendency to have a large number of zero co¬ 
efficients, and polynomials of huge degrees, so special forms of representation are 
desirable; see Section 2.2.4. Furthermore, arithmetic on polynomials in several 
variables leads to routines that are best understood in a recursive framework; 
this situation is discussed in Chapter 8. 

Although the techniques of polynomial addition, subtraction, and multi¬ 
plication are comparatively straightforward, there are several other important 
aspects of polynomial arithmetic that deserve special examination. The following 
subsections therefore discuss division of polynomials, with associated techniques 
such as finding greatest common divisors and factoring. We shall also discuss 
the problem of efficient evaluation of polynomials, i.e., the task of finding the 
value of u{x) when x is a given element of S, using as few operations as possible. 
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The special case of evaluating x n rapidly when n is large is quite important by 
itself, so it is discussed in detail in Section 4.6.3. 

The first major set of computer subroutines for doing polynomial arithmetic 
was the ALPAK system [W. S. Brown, J. P. Hyde, and B. A. Tague, Bell System 
Tech. J. 42 (1963), 2081-2119; 43 (1964), 785-804, 1547-1562]. Another early 
landmark in this field was the PM system of George Collins [CACM 9 (1966), 
578-589]; see also C. L. Hamblin, Comp. J. 10 (1967), 168-171. 


EXERCISES 

1. [10] If we are doing polynomial arithmetic modulo 10, what is lx -}- 2 minus 
x 2 -(- 3? What is 6a: 2 + x + 3 times 5a: 2 -(- 2? 

2 . [17] True or false: (a) The product of monic polynomials is monic. (b) The 
product of polynomials of respective degrees m and n has degree m-\-n. (c) The sum 
of polynomials of respective degrees m and n has degree max(m, n). 

3. [ M20 ] If each of the coefficients u r , ..., uo, v s , ..., in (4) is an integer satisfying 
the conditions |ui| < mi, ]u,j < m2, what is the maximum absolute value of the 
product coefficients Wk 9 - 

► 4. [21] Can the multiplication of polynomials modulo 2 be facilitated by using the 
ordinary arithmetic operations on a binary computer, if coefficients are packed into 
computer words? 

► 5. [ M21 ] Show how to multiply two polynomials of degree < n, modulo 2, with 
an execution time proportional to 0(n lg3 ) when n is large, by adapting Karatsuba’s 
method (cf. Section 4.3.3). 


4.6.1. Division of Polynomials 

It is possible to divide one polynomial by another in essentially the same way 
that we divide one multiple-precision integer by another, when arithmetic is 
being done on polynomials over a “field.” A field S is a commutative ring with 
identity, in which exact division is possible as well as the operations of addition, 
subtraction, and multiplication; this means as usual that whenever u and v are 
elements of S, and there is an element w in S such that u = vw. The 

most important fields of coefficients that arise in applications are 

a) the rational numbers (represented as fractions, see Section 4.5.1); 

b) the real or complex numbers (represented within a computer by means of 
floating point approximations; see Section 4.2); 

c) the integers modulo p where p is prime (where division can be implemented 
as suggested in exercise 4.5.2-15); 

d) “rational functions” over a field (namely, quotients of two polynomials whose 
coefficients are in that field, the denominator being monic). 
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Of special importance is the field of integers modulo 2 , when the two values 0 
and 1 are the only elements of the field. Polynomials over this field (namely 
polynomials modulo 2 ) have many analogies to integers expressed in binary 
notation; and rational functions over this field have striking analogies to rational 
numbers whose numerator and denominator are represented in binary notation. 

Given two polynomials u(x) and v(x) over a field, with v(x) ^ 0, we can 
divide u(x) by v(x) to obtain a quotient polynomial q(x) and a remainder poly¬ 
nomial r(x) satisfying the conditions 

u(x) = q{x) • v(x ) + r(x), deg(r) < deg(v). ( 1 ) 

It is easy to see that there is at most one pair of polynomials ( q(x ), r(z)) satisfying 
these relations; for if ( 1 ) holds for both (qi(x), ri(:r)) and (q 2 (z), ?* 2 (z)) and for 
the same polynomials u(x),v(x), then qi(x)v(x) + ri(z) = q 2 {x)v(x) r 2 (z), 

so (qi(x) — q 2 {x))v(x) = r 2 (x) — r*i(x). Now if qi{x) — q 2 (z) is nonzero, we 
have deg((qi — q 2 ) • v) = deg(<?i — q 2 ) + deg(v) > deg(w) > deg(r 2 — r x ), a 
contradiction; hence qi{x) — q 2 (z) = 0 and ri(z) = 0. * 

The following algorithm, which is essentially the same as Algorithm 4.3.ID 
for multiple-precision division but without any concerns of carries, may be used 
to determine q(x) and r(x): 

Algorithm D ( Division of polynomials over a field). Given polynomials 

u(x) = u m x m -j-h UiX + w 0 , v(x ) = v n x n -j- 1 - V\X + Vq 

over a field S, where v n 0 and m > n > 0, this algorithm finds the 
polynomials 

q{x) = q m — n x m n H-b Qo, r(x) = r n -ix ^ 1 -j-b ^0 

over S that satisfy (1). 

Dl. [Iterate on k.] Do step D2 for k = m — n, m — n — 1, ..., 0; then the 
algorithm terminates with (r n _ 1,..., r 0 ) «— {u n ~ 1,..., wo). 

D2. [Division loop.] Set q^ <— Un^^/vn, and then set Uj •<— Uj — qkVj—k for 
j -=n-\-k — 1 , n-j-k — 2 , ..., k. (The latter operation amounts to replacing 
u(x) by u(x) — qkX k v(x), a polynomial of degree < n + k.) | 

An example of Algorithm D appears below in (5). The number of arithmetic 
operations is essentially proportional to n(m — n -f- 1). For some reason this 
procedure has become known as “synthetic division” of polynomials. Note that 
explicit division of coefficients is done only at the beginning of step D2, and the 
divisor is always v n ; thus if v(x) is a monic polynomial (with v n = 1 ), there is no 
actual division at all. If multiplication is easier to perform than division it will 
be preferable to compute 1 /v n at the beginning of the algorithm and to multiply 
by this quantity in step D2. 

We shall often write u(z) mod u(x) for the remainder r(x) in (1). 
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Unique factorization domains. If we restrict consideration to polynomials over a 
field, we are not coming to grips with many important cases, such as polynomials 
over the integers or polynomials in several variables. Let us therefore now 
consider the more general situation that the algebraic system S of coefficients is 
a unique factorization domain, not necessarily a field. This means that S is a 
commutative ring with identity, and that 

i) uv 7 ^ 0 , whenever u and v are nonzero elements of S ; 

ii) every nonzero element u of S is either a “unit” or has a “unique” repre¬ 
sentation of the form 


u = p x ...p t , t> 1 , (2) 

where p x , ..., p t are “primes.” 

Here a “unit” u is an element that has a reciprocal, i.e., an element such that 
uv = 1 for some v in S; and a “prime” p is a nonunit element such that the 
equation p = qr can be true only if either q or r is a unit. The representation 
( 2 ) is to be unique in the sense that if p x ... p t = q x ... q s , where all the p’s and 
<?’s are primes, then s = t and there is a permutation 7 Ti... 7 r t of { 1 ,..., t} such 
that pi — a x q 7ri , ..., p t — a t qir t for some units a x , ..., a t . In other words, 
factorization into primes is unique, except for unit multiples and except for the 
order of the factors. 

Any field is a unique factorization domain, in which each nonzero element is 
a unit and there are no primes. The integers form a unique factorization domain 
in which the units are +1 and — 1 , and the primes are ^ 2 , ± 3 , ± 5 , ±7, ill, 
etc. The case that S is the set of all integers is of principal importance, because 
it is often preferable to work with integer coefficients instead of arbitrary rational 
coefficients. 

One of the key facts about polynomials (see exercise 10) is that the polyno¬ 
mials over a unique factorization domain form a unique factorization domain. 
A polynomial that is “prime” in this domain is usually called an irreducible 
polynomial. By using the unique factorization theorem repeatedly, we can prove 
that multivariate polynomials over the integers, or over any field, in any number 
of variables, can be uniquely factored into irreducible polynomials. For example, 
the multivariate polynomial 90x 3 — 120z 2 y 18x 2 yz — 2Axy 2 z over the integers 
is the product of five irreducible polynomials 2 • 3 • x • (3x — Ay) • (5z + yz). 
The same polynomial, as a polynomial over the rationals, is the product of three 
irreducible polynomials ( 6 z) • (3z — Ay) • (5z -J- yz)] this factorization can also be 
written x • (90z —120 y) • (x \yz) and in infinitely many other ways, although 
the factorization is essentially unique. 

As usual, we say that u{x) is a multiple of v(x), and that v(x) is a divisor 
of u(x), if u(x) = v(x)q(x) for some polynomial q(x). If we have an algorithm to 
tell whether or not u is a multiple of v for arbitrary nonzero elements u and v 
of a unique factorization domain S, and to determine w if u = v * w, then 
Algorithm D gives us a method to tell whether or not u(x) is a multiple of v(x) 
for arbitrary nonzero polynomials u(x) and v(x) over S. For if u(x) is a multiple 
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of v(x), it is easy to see that u n ^ k must be a multiple of v n each time we get to 
step D2, hence the quotient u(x)/v(x) will be found. (Applying this observation 
repeatedly, we obtain an algorithm that decides if a given polynomial over S, in 
any number of variables, is a multiple of another given polynomial over S, and 
the algorithm will find the quotient when it exists.) 

A set of elements of a unique factorization domain is said to be relatively 
prime if no prime of that unique factorization domain divides all of them. A 
polynomial over a unique factorization domain is called primitive if its coeffi¬ 
cients are relatively prime. (This concept should not be confused with the quite 
different idea of “primitive polynomials modulo p” discussed in Section 3.2.2.) 
The following fact, introduced for the case of polynomials over the integers 
by K. F. Gauss in article 42 of his celebrated book Disquisitiones Arithmetics 
(Leipzig: 1801), is of prime importance: 

Lemma G (Gauss’s Lemma). The product of primitive polynomials over a unique 
factorization domain is primitive. 

Proof. Let u(x) = u m x m -|-j- u 0 and v(x) = v n x n -|-1- v 0 be primitive 

polynomials. If p is any prime of the domain, we must show that p does not 
divide all the coefficients of u(x)v(x). By assumption, there is an index j such 
that Uj is not divisible by p, and an index k such that v k is not divisible by p. 
Let j and k be as small as possible; then the coefficient of x 3 + k in u(x)v(x) is 

UjV k J r u J+ 1 H-b u j+k v o ~b u j—iVk+i H-b u o v k-\-ji and it is easy to 

see that this is not a multiple of p (since its first term isn’t, but all of its other 
terms are). | 

If a nonzero polynomial u(x) over a unique factorization domain S is not 
primitive, we can write u(x) — p\ • Ui(x), where p\ is a prime of S dividing all 
the coefficients of u(x), and where U\{x) is another nonzero polynomial over S. 
All of the coefficients of Ui(x) have one less prime factor than the corresponding 
coefficients of u(x). Now if ui(x) is not primitive, we can write Ui(x) = p2-u 2 {x), 
etc.; this process must ultimately terminate in a representation u(x) = c ■ u k (x), 
where c is an element of S and u k (x) is primitive. In fact, we have the following 
companion to Lemma G: 

Lemma H. Any nonzero polynomial u(x) over a unique factorization domain S 
can be factored in the form u(x) = c ■ v(x), where c is in S and v(x) is primitive. 
Furthermore, this representation is unique , in the sense that if u = c i • Vi(x) = 
C2 ■ V2(x), then c i = ac2 and v 2 (x) — av i(x) where a is a unit of S. 

Proof. We have shown that such a representation exists, so only the uniqueness 
needs to be proved. Assume that C\ • V\{x) = C 2 • V 2 {x ), where Vi(x) and t^Or) 
are primitive and c\ is not a unit multiple of C 2 . By unique factorization there 
is a prime p of S and an exponent k such that p k divides one of {ci,C 2 } but 
not the other, say p k divides Ci but not c 2 . Then p k divides all of the coeffi¬ 
cients of c 2 • ^ 2 ( 2)7 so p divides all the coefficients of v 2 (x), contradicting the 
assumption that v 2 (x) is primitive. Hence C\ — ac 2 , where a is a unit; and 
0 — ac 2 ■ vi(x) — c 2 • v 2 {x) = c 2 • ( 0 ^ 1 ( 2 :) — V 2 (x)), so av\(x) — v 2 (x) = 0 . | 
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Therefore we may write any nonzero polynomial u(x) as 

u(x) = cont(w) • pp(it(z)), (3) 

where cont(u), the “content” of u, is an element of S, and pp(u(x)), the “primitive 
part” of u(x), is a primitive polynomial over S. When u(x) = 0, it is convenient 
to define cont (u) = pp(u(x)) = 0. Combining Lemmas G and H gives us the 
relations 

cont(u • v) — a cont(u) cont(v), 

( 4) 

pp(u(x) • v(z)) = 6pp(u(x))pp(u(x)), 

where a and b are units, depending on u and v, with ab = 1. When we are 
working with polynomials over the integers, the only units are -\-l and —1, and 
it is conventional to define pp(u(x)) so that its leading coefficient is positive; then 
(4) is true with a = b = 1. When working with polynomials over a field we may 
take cont(u) = £(u), so that pp(w(x)) is monic; in this case again (4) holds with 
a = b = 1, for all u(x) and v(x). 

For example, if we are dealing with polynomials over the integers, let u(x) = 
— 26x 2 + 39 and v(x) = 21x 4- 14. Then 

cont(u) = — 13, pp(w(x)) = 2x 2 — 3, 

cont(v) = -|-7, pp(u(x)) = 3x -\-2, 

cont(u ■ v) = —91, pp(w(x) • v(x)) = 6x 3 4- 4x 2 — 9x — 6. 

Greatest common divisors. When there is unique factorization, it makes sense to 
speak of a “greatest common divisor” of two elements; this is a common divisor 
that is divisible by as many primes as possible. (Cf. Eq. 4.5.2-6.) Since a unique 
factorization domain may have many units, however, there is a certain amount 
of ambiguity in this definition of greatest common divisor; if w is a greatest 
common divisor of u and v , so is a • w, when a is any unit. Conversely, the 
assumption of unique factorization implies that if w\ and W 2 are both greatest 
common divisors of u and v, then w\ = for some unit a. In other words it 
does not make sense, in general, to speak of “the” greatest common divisor of u 
and v ; there is a set of greatest common divisors, each one being a unit multiple 
of the others. 

Let us now consider the problem of finding a greatest common divisor of 
two given polynomials over an algebraic system S. If S is a field, the problem 
is relatively simple; our division algorithm, Algorithm D, can be extended to an 
algorithm that computes greatest common divisors, just as Euclid’s algorithm 
(Algorithm 4.5.2A) yields the greatest common divisor of two given integers based 
on a division algorithm for integers: If v(x) = 0, then gcd(w(x), f(z)) = u(x)‘, 
otherwise gcd(ii(x), v(x)) = gcd(t»(x), r(x)), where r(x) is given by (1). This 
procedure is called Euclid’s algorithm for polynomials over a field; it was first 
used by Simon Stevin in 1585 [Les oeuvres mathematiques de Simon Stevin, ed. 
by A. Girard, 1 (Leyden, 1634), 56]. 
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For example, let us determine the gcd of x 8 -(- x 6 -(- 10a: 4 -f- lOx 3 -f- 8a: 2 -f- 
2x -(- 8 and 3a; 6 + 5a; 4 9x 2 + 4a; -j- 8, mod 13, by using Euclid’s algorithm for 

polynomials over the integers modulo 13. First, writing only the coefficients to 
show the steps of Algorithm D, we have 

_ 9 0 7 

3 0 5 0 9 4 8 JTO 1 0 10 10 8 2 8 

1060 3 10 7 

0 8 0 7 0 1 2 8 
8 0 9 0 11 2 4 
0 11 0 304 

and hence 

x 8 + x e -j- lOx 4 -f 10a; 3 + 8x 2 + 2a; + 8 

= (9x 2 + 7)(3i 6 + 5z 4 + 9l 2 + 4x + 8) + (llx 4 + 3x 2 + 4). 

Similarly, 

3a; 6 + 5a; 4 + 9a; 2 + 4x + 8 = (5a; 2 + 5)(llx 4 + 3a; 2 + 4) + (4x + 1); 

11a; 4 + 3a; 2 + 4 = (6a; 3 + 5a; 2 + 6a: + 5)(4a; + 1) + 12; 

4x + 1 = (9a: -|- 12) -12 + 0. (6) 

(The equality sign here means congruence modulo 13, since all arithmetic on 
the coefficients has been done mod 13.) This computation shows that 12 is 
a greatest common divisor of the two original polynomials. Now any nonzero 
element of a field is a unit of the domain of polynomials over that field, so it 
is conventional in the case of fields to divide the result of the algorithm by its 
leading coefficient, producing a monic polynomial that is called the greatest 
common divisor of the two given polynomials. The gcd computed in (6) is 
accordingly taken to be 1, not 12. The last step in (6) could have been omitted, 
for if deg(v) = 0, then gcd(ii(a;), v(x)) = 1, no matter what polynomial is chosen 
for u(x). Exercise 4 determines the average running time for Euclid’s algorithm 
on random polynomials modulo p. 

Let us now turn to the more general situation in which our polynomials are 
given over a unique factorization domain that is not a field. From Eqs. (4) we 
can deduce the important relations 

cont(gcd(w, v)) = a ■ gcd(cont(u), cont(v)), 
pp(gcd(u(a:), v{x))) = b • gcd(pp(it(x)), pp(v(x))), ^ 

where a and b are units. Here gcd(«(a;), v(a;)) denotes any particular polynomial 
in x that is a greatest common divisor of u(x) and v(x). Equations (7) reduce 
the problem of finding greatest common divisors of arbitrary polynomials to the 
problem of finding greatest common divisors of primitive polynomials. 
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Algorithm D for division of polynomials over a field can be generalized to a 
pseudo-division of polynomials over any algebraic system that is a commutative 
ring with identity. We can observe that Algorithm D requires explicit division 
only by £(v), the leading coefficient of v(x), and that step D2 is carried out exactly 
m — n - f-1 times; thus if u(x) and v(x) start with integer coefficients, and if we 
are working over the rational numbers, then the only denominators that appear 
in the coefficients of q(x) and r(x ) are divisors of n + 1 . This suggests that 

we can always find polynomials q(x) and r(x) such that 

n+ 1 u(x) = q(x)v(: r) + r(x), deg(r) < n, (8) 

where m = deg(u) and n = deg(v), for any polynomials u(x) and v(x) ^ 0, 
provided that m > n. 

Algorithm R (Pseudo-division of polynomials). Given polynomials 

u(x) = u m x m H-b UxX + u 0 , v(x) = v n x n H-b vi(x) + v 0 , 

where v n 0 and m > n > 0, this algorithm finds polynomials q(x) = 

qm- n x rn ~ n H-b <?o and r(x) — r n _ix n_1 -\ -b r 0 satisfying (8). 

Rl. [Iterate on A;.] Do step R2 for k = m — n, m — n — 1, ..., 0; then the 
algorithm terminates with (r n _i,..., r 0 ) = (u n — 1 ,..., wq). 

R2. [Multiplication loop.] Set qk •<— Un+kV*, and set u 3 •<— v n Uj — u n -\-kVj—k for 
j — n-\-k — 1, n-\-k — 2, ...,0. (When j < k this means that u 3 <— v n Uj, 
since we treat v—i, V— 2 , ... as zero. These multiplications could have been 
avoided if we had started the algorithm by replacing u t by v™ n t u t , for 
0 < t < m — n.) | 

An example calculation appears below in (10). It is easy to prove the validity 
of Algorithm R by induction on m—n, since each execution of step R2 essentially 
replaces u(x) by £(v)u(x) — £(u)x k v(x), where k = deg(u) — deg(i>). Note that 
no division whatever is used in this algorithm; the coefficients of q(x) and r(x) 
are themselves certain polynomial functions of the coefficients of u(x) and v(x). 
If v n = 1, the algorithm is identical to Algorithm D. If u(x) and v(x) are 
polynomials over a unique factorization domain, we can prove as before that the 
polynomials q(x ) and r(x) are unique; therefore another way to do the pseudo¬ 
division over a unique factorization domain is to multiply u(x) by n + 1 and 
apply Algorithm D, knowing that all the quotients in step D2 will exist. 

Algorithm R can be extended to a “generalized Euclidean algorithm” for 
primitive polynomials over a unique factorization domain, in the following way: 
Let u(x) and v(x) be primitive polynomials with deg(u) > deg(u), and determine 
the polynomial r(x) satisfying (8) by means of Algorithm R. Now we can prove 
that gcd(u(x), v(x)) = gcd(u(a:), r(x)): Any common divisor of u(x) and v(x) 
divides v(x) and r(x); conversely, any common divisor of v(x) and r(x ) divides 
£(v) rn ~ n + 1 u(x), and it must be primitive (since v(x) is primitive) so it divides 
u(x). If r(x) — 0, we therefore have gcd(u(x), u(x)) = v(x) m , on the other hand if 
r(x) 7 ^ 0, we have gcd(u(x), r(x)) = gcd(v(x), pp((r(x))) since v(x) is primitive, 
so the process can be iterated. 
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Algorithm E (Generalized Euclidean algorithm). Given nonzero polynomials 
u(x) and v(x) over a unique factorization domain S, this algorithm calculates a 
greatest common divisor of u(x) and v(x). We assume that auxiliary algorithms 
exist to calculate greatest common divisors of elements of S, and to divide a 
by 6 in S' when b 7 ^ 0 and a is a multiple of b. 

El. [Reduce to primitive.] Set d gcd(cont(tt), cont(t>)), using the assumed 
algorithm for calculating greatest common divisors in S. (Note that cont(u) 
is a greatest common divisor of the coefficients of u(x).) Replace u(x) by the 
polynomial w(z)/cont(w) = pp(u(a;)); similarly, replace v(x) by pp(^(z)). 

E2. [Pseudo-division.] Calculate r(z) using Algorithm R. (it is unnecessary to 
calculate the quotient polynomial q(x).) If r(x) = 0, go to E4. If deg(r) = 0, 
replace 'u(x) by the constant polynomial “1” and go to E4. 

E3. [Make remainder primitive.] Replace u(x) by v(x) and replace v(x) by 
pp(r(x)). Go back to step E2. (This is the “Euclidean step,” analogous 
to the other instances of Euclid’s algorithm that we have seen.) 

E4. [Attach the content.] The algorithm terminates, with d • v(x) as the desired 
answer. | 

As an example of Algorithm E, let us calculate the gcd of the polynomials 

u(x) = x 8 + x G — 3a : 4 — 3a : 3 -f- 8 x 2 + 2a: — 5, 

v(x) = 3a ; 6 + 5a ; 4 - 4a ; 2 - 9a; + 21, ^ 

over the integers. These polynomials are primitive, so step El sets d <— 1 . In 
step E2 we have the pseudo-division 

_1_J) 

3 0 5 0 —4 —9 21 ) 1 0 1 0 —3 —3 8 2 

3 0 3 0 —9 —9 24 6 

3 0 5 0 —4 —9 21 

0 —2 0 —5 0 3 6 

0 —6 0 —15 0 9 18 

0 0 0 0 0 0 0 

—6 0 -15 0 9 18 

—18 0 —45 0 27 54 

—18 0 —30 0 24 54 

-15 0 3 0 

Here the quotient q(x) is 1 • 3 2 a ; 2 0 • 3*a; -|- 6 -3°; we have 

27 u(x) = ta(a;)(9a ; 2 — 6 ) + (—15x 4 -f 3a ; 2 — 9). ( 11 ) 

Now step E3 replaces u(x) by v(x) and v(x) by pp(r(x)) = 5a ; 4 — a ; 2 -f- 3. The 
subsequent calculation is summarized in the following table, where only the 
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coefficients are shown: 
u(x) 

1,0,1,0, —3, —3,8,2, —5 
3,0,5,0,— 4, —9, 21 
5,0,—1,0,3 
13,25,-49 


v(x) 

3,0,5,0, —4, —9, 21 
5,0,—1,0,3 
13,25,-49 
4663, —6150 


r{x) 

—15,0,3,0,—9 
-585,-1125,2205 
—233150,307500 

143193869 (12) 


It is instructive to compare this calculation with the computation of the 
same greatest common divisor over the rational numbers, instead of over the 
integers, by using Euclid’s algorithm for polynomials over a field as described 
earlier in this section. The following surprisingly complicated sequence appears: 

u(x) v(x) 


1,0,1,0, -3, -3,8,2,—5 
3,0,5,0, —4,— 9,21 

—|,o, JA — J 

117 n 441 

— T5 > — “55“ 

233150 102500 

19773 ’ 6591 


3,0,5,0, —4,—9,21 

— f A JA — J 

_ 117 _Q Ml 

“55“’ 25 

233150 102500 

19773 ’ 6591 

1288744821 

543589225 


To improve that algorithm, we can reduce u(x) and v(x) to monic polyno¬ 
mials at each step, since this removes “unit” factors that make the coefficients 
more complicated than necessary; this is actually Algorithm E over the rationals: 


u(x) 

1,0,1,0, —3, —3,8,2,—5 
1,0, |,0,— |,-3,7 

i o _1 0 3 

u ’ 5 ’ u > 5 

1.S.-S 

1.-SM8 


v(x) 

1,0, f,0, -I, -3,7 

1,0,-i, 0,1 

1 25 49 

1 ’ 13’ 13 

1 _ 6150 

3663 


(14) 


In both (13) and (14) the sequence of polynomials is essentially the same as 
(12), which was obtained by Algorithm E over the integers; the only difference is 
that the polynomials have been multiplied by certain rational numbers. Whether 
we have 5x 4 — x 2 3 or —Jx 4 -|- §x 2 — J or x 4 — Jx 2 the computations 
are essentially the same. But either algorithm using rational arithmetic will 
run noticeably slower than the all-integer Algorithm E, since rational arithmetic 
requires many more evaluations of gcd’s of integers within each step. Therefore 
it is definitely better to use the all-integer algorithm. 

It is also instructive to compare the above calculations with (6) above, where 
we determined the gcd of the same polynomials u(x) and v(x) modulo 13 with 
considerably less labor. Since £(u) and £(v) are not multiples of 13, the fact 
that gcd(w(x), v(x)) — 1 modulo 13 is sufficient to prove that u(x) and v(x) are 
relatively prime over the integers (and therefore over the rational numbers); we 
will return to this time-saving observation at the close of Section 4.6.2. 
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The subresultant algorithm. An ingenious algorithm that is generally superior 
to Algorithm E, and that gives us further information about Algorithm E’s 
behavior, was discovered by George E. Collins [JACM 14 (1967), 128-142] and 
subsequently improved by W. S. Brown and J. F. Traub [JACM 18 (1971), 505- 
514; see also W. S. Brown, ACM Trans. Math. Software 4 (1978), 237-249]. This 
algorithm avoids the calculation of primitive parts in step E3, dividing instead 
by an element of S that is known to be a factor of r(x ): 

Algorithm C (Greatest common divisor over a unique factorization domain). This 
algorithm has the same input and output assumptions as Algorithm E, and has 
the advantage that fewer calculations of greatest common divisors of coefficients 
are needed. 

Cl. [Reduce to primitive.] As in step El of Algorithm E, set d <— gcd(cont(u), 
cont(v)), and replace (u(x),u(x)) by (pp(ii(x)),pp(v(x))). Set g <- h 1 . 

C2. [Pseudo-division.] Set 6 <- deg(u) — deg(v). Calculate r(x) using Algo¬ 
rithm R. If r{x) = 0, go to C4. If deg(r) = 0, replace v{x) by the constant 
polynomial “1” and go to C4. 

C3. [Adjust remainder.] Replace the polynomial u(x) by v(x), and replace v(x) 
by r(x)/gh 6 . (At this point all coefficients of r(x) are multiples of gh 6 .) 
Then set g <— £{u), h «— h 1 ~ 6 g 6 and return to C2. (The new value of h will 
be in the domain S, even if 6 > 1.) 

C4. [Attach the content.] Return d • pp( , y(x)) as the answer. | 

If we apply this algorithm to the polynomials (9) considered earlier, the 
following sequence of results is obtained at the beginning of step C2: 


u(x) 


v(x) 


g h 


1,0,1,0, —3, —3,8,2,-5 
3,0,5,0, —4,—9,21 
—15,0,3,0,-9 
65,125,-245 


3,0,5,0, —4, —9,21 1 

-15,0,3,0,-9 3 

65,125,-245 —15 

—9326,12300 65 


1 

9 

25 

169 


(15) 


At the conclusion of the algorithm, r(x)/gh 6 = 260708. 

The sequence of polynomials consists of integral multiples of the polynomials 
in the sequence produced by Algorithm E. In spite of the fact that the polyno¬ 
mials are not reduced to primitive form, the coefficients are kept to a reasonable 
size because of the reduction factor in step C3. 

In order to analyze Algorithm C and to prove that it is valid, let us call the 
sequence of polynomials it produces ui(x ), u 2 (x), u 3 (x), ... , where Ui(x) — u(x) 
and u 2 (x) = v(x). Let d 3 = Uj — rij+i for j > 1, where rij = deg(uj); and let 
9i — hi = 1, g 3 - = £(Uj), hj = g*’- 1 for j > 2. Then we have 


g s 2 1 + 1 Ui(x) = u 2 {x)q x (x) + g 1 h s l 'u 3 (x), n 3 < n 2 ; 

g 6 3 2 + 1 u 2 (x) = u 3 (x)q 2 (x) + g 2 h 2 2 u 4 (x}, n 4 < n 3 ; 

94 s+, ii 3 (i) = u 4 {x)q 3 {x) + g 3 h 6 3 3 u 5 (x), n 5 < ra 4 ; 


( 16 ) 
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Table 1 

COEFFICIENTS IN ALGORITHM C 


Row 

name 

Row 

Multiply 

by 

Replace 
by row 

A 5 

<28 

a 7 

fig 

<25 

a 4 

<23 

<22 

<21 

<20 

0 

0 

0 

0 

0 

bl 

Ce 

A 4 

0 

ag 

<27 

<26 

(25 

(2 4 

<23 

<22 

<2l 

<2o 

0 

0 

0 

0 

fri 

Ca 

a 3 

0 

0 

<28 

<27 

<2 6 

<25 

<24 

<23 

<22 

<2i 

<2o 

0 

0 

0 

fri 

c 3 

A 2 

0 

0 

0 

ag 

<27 

<26 

<25 

<2 4 

<23 

<22 

<2l 

ao 

0 

0 

fri 

c 2 

Ax 

0 

0 

0 

0 

<28 

<27 

<26 

<25 

<2 4 

<23 

<22 

ax 

ao 

0 

h 3 

°e 

Ci 

Ao 

0 

0 

0 

0 

0 

<28 

<27 

<26 

<25 

a 4 

<23 

a 2 

ai 

ao 

fri 

Co 

b 7 

be 

h 

64 

b 3 

b 2 

bi 

bo 

0 

0 

0 

0 

0 

0 

0 



Be 

0 

be 

bs 

64 

b 3 

b 2 

bi 

fro 

0 

0 

0 

0 

0 

0 



£5 

0 

0 

be 

be 

64 

bs 

b 2 

fri 

fro 

0 

0 

0 

0 

0 



£4 

0 

0 

0 

be 

be 

bA 

b 3 

fr 2 

fri 

fro 

0 

0 

0 

0 



£3 

0 

0 

0 

0 

be 

be 

bA 

fr 3 

b 2 

fri 

fro 

0 

0 

0 

cl/bl 

£3 

£2 

0 

0 

0 

0 

0 

be 

be 

fr 4 

b 3 

b 2 

fri 

fro 

0 

0 

cl/bl 

£2 

£1 

0 

0 

0 

0 

0 

0 

be 

be 

b 4 

b 3 

b 2 

fri 

fro 

0 

cl/bl 

£1 

£0 

0 

0 

0 

0 

0 

0 

0 

be 

be 

bA 

fr 3 

fr 2 

fri 

fro 

cl/bl 

£0 

C 5 

0 

0 

0 

0 

CA 

C3 

C2 

Cl 

Co 

0 

0 

0 

0 

0 



c 4 

0 

0 

0 

0 

0 

Ca 

C3 

c 2 

Cl 

Co 

0 

0 

0 

0 



c 3 

0 

0 

0 

0 

0 

0 

ca 

C3 

C2 

Cl 

co 

0 

0 

0 



c 2 

0 

0 

0 

0 

0 

0 

0 

Ca 

C3 

C 2 

Cl 

Co 

0 

0 



Cl 

0 

0 

0 

0 

0 

0 

0 

0 

c 4 

C3 

C 2 

Cl 

co 

0 

d\b\/c\ 

£1 

Co 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Ca 

C 3 

C 2 

Cl 

Co 

d\b\/c\ 

£0 

£3 

0 

0 

0 

0 

0 

0 

0 

0 

d 2 

di 

do 

0 

0 

0 



£2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

d 2 

di 

do 

0 

0 



£1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

d 2 

di 

do 

0 



£0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

d 2 

di 

do 

t\c\jd\b\ 

£0 

£1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

ei 

eo 

0 



£0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

ei 

eo 



£0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

fo 




and so on. The process terminates when rifc-hi — deg(i4/c_|_i) < 0. We must show 
that 143 ( 2 ;), 144 ( 1 ), ..., have coefficients in S, i.e., that the factors gjh^ j evenly 
divide the remainders, and we must also show that the hj values all belong to S. 
The proof is rather involved, and it can be most easily understood by considering 
an example. 

Suppose, as in (15), that n 4 — 8, n 2 = 6, n 3 = 4, n 4 = 2, n 5 = 1 , n 6 = 0, 
so that < 5 i = 62 = 63 = 2, 84 = 65 = 1. Let us write Ui(x) = agx 8 + a i xl Hb 

- \-ciq, u 2 (x) = b e x 6 + b 5 x 5 -\ -1- bo, ..., u 5 (x) = eiX-\-e 0 , u e (x) = f 0 , so 

that hi = 1, h 2 = fr|, /i 3 = c\/b\, h 4 = d^frg/c^. these terms it is helpful to 
consider the array shown in Table 1. For concreteness, let us assume that the co¬ 
efficients of the polynomials are integers. We have blui(x) = u 2 (x)qi(x)u 3 (x); 
so if we multiply row A 5 by and subtract appropriate multiples of rows £7, 
Bq, and £5 (corresponding to the coefficients of qi(xfj we will get row C5. If we 
also multiply row A 4 by frg and subtract multiples of rows £ 6 , B 5 , and £ 4 , we 
get row C 4 . In a similar way, we have 4 : 41 x 2 ( 2 ;) = u^(x)q 2 (x) + blu^x); so we 
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can multiply row B 3 by c 4 , subtract integer multiples of rows Co, C 4 , and C 3 , 
then divide by to obtain row D 3 . 

In order to prove that u 4 (x) has integer coefficients, let us consider the matrix 

A 2 aj a 6 a 5 a 4 a 3 a 2 g x a 0 0 0 

Ai 0 Gg a 7 a 6 G 5 G 4 G 3 G 2 G X Go 0 

Ao 0 0 a 8 a 7 ao <25 g 4 g 3 g 2 g x Go 

B 4 be 65 6 4 i) 3 &2 ^1 0 0 0 0 

B 3 0 bo ^5 ^4 &3 ^2 fro 0 0 0 

B 2 0 0 be b 3 6 4 b 3 b 2 b\ bo 0 0 

Bi 0 0 0 b e fr 5 b 4 b 3 b 2 fr x bo 0 

Bo Vo 0 0 0 bo b 3 b 4 b 3 b 2 bi bo 

The indicated row operations and a permutation of rows will transform M into 



B 4 / bo bo b 4 b 3 b 2 b\ bo 0 0 0 0 > 

B 3 0 bo bo b 4 b 3 b 2 bi bo 0 0 0 

B 2 0 0 ^6 65 b 4 b 3 b 2 b\ bo 0 0 

Bi 0 0 0 be b 5 b 4 b 3 b 2 h b 0 0 _ , 

C 2 0 0 0 0 c 4 c 3 c 2 Ci c 0 0 0 — 

Ci 0 0 0 0 0 c 4 c 3 c 2 ci cq 0 

Co 0 0 0 0 0 0 c 4 c 3 c 2 Ci c 0 

Do V 0 0 0 0 0 0 0 0 d 2 d 0 / 

Because of the way M' has been derived from M, we must have 

bl ’ b l • ^6 ‘ ( c 4/ b l) * det M 0 = ± det Mq, 


if Mo and Mq represent any square matrices obtained by selecting eight cor¬ 
responding columns from M and M'. For example, let us select the first seven 
columns and the column containing di; then 




Gg a 7 

0 Gg 
0 0 

0 bo 
0 0 
0 0 
0 0 


ao ao 
a-j ao 

Gg G7 

b 4 b 3 
bo b 4 
b& bo 
0 bo 
0 0 


a 4 a 3 
G5 g 4 
G6 G5 
b 2 bi 
b 3 b 2 
b 4 b 3 
bo b 4 
bo bo 


a 2 0 
g 3 Go 
g 4 a 1 
bo 0 

bi 0 

b 2 0 
b 3 bo 
b 4 bi 



Since fr 6 c 4 7^ 0, this proves that d x is an integer. Similarly, d 2 and d 0 are integers. 

In general, we can show that Uj+\(x) has integer coefficients in a similar 
manner. If we start with the matrix M consisting of rows An 2 — nj through 
A 0 and B ni _ nj . through B 0 , and if we perform the row operations indicated in 
Table 1, we will obtain a matrix M' consisting in some order of rows £ ni _ nj 
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through fl n3 — nj .+i, C n2 — nj through C n4 _ nj _|_i, ..., P nj _ 2 - nj through Pi, 
Qn^y—nj through Qo, and finally R 0 (a row containing the coefficients of 
Uj+ i(x)). Extracting appropriate columns shows that 

(9i 1+1 /9ihiT 1 ~ nt+ \9i 2+1 /g2hlT 3 ~ ni+x ---(g' i - ,+1 /gj-ih-!r 1 T i ~ ni+l 

XdetMo = (19) 

where r t is a given coefficient of Uj+ i(x) and Mo is a submatrix of M. The h’s 
have been chosen very cleverly so that this equation simplifies to 

det M 0 = ir t (20) 

(see exercise 24 ). Therefore every coefficient ofv,j+i(x) can be expressed as the 
determinant of an (n x + n 2 — 2 rtj + 2 ) X (n x -j- n 2 — 2 n 3 + 2 ) matrix whose 
elements are coefficients of u(x) and v(x). 

It remains to be shown that the cleverly chosen h’s also are integers. A 
similar technique applies: Let’s look, for example, at the matrix 



( d$ 

a 7 

de 

as 

a 4 

d 3 

d 2 

di 

do 

0 

Aq 

0 

0-8 

dj 

a 6 

a 5 

d 4 

d 3 

d 2 

d\ 

ao 

b 3 

be 

be 

b 4 

^3 

b 2 

bi 

bo 

0 

0 

0 

b 2 

0 

be 

be 

b 4 

b 3 

b 2 

bi 

^0 

0 

0 

B\ 

1 ° 

0 

be 

be 

b 4 

b 3 

b 2 

bi 

bo 

0 

B 0 

l 0 

0 

0 

be 

be 

b 4 

b 3 

b 2 

b i 

bo 


= M. 


Row operations as specified in Table 1 , and permutation of rows, leads to 


B 3 

f be 

be 

b 4 

b 3 

b 2 

bi 

bo 

0 

0 

0 

b 2 

0 

be 

be 

b 4 

b 3 

b 2 

b i 

bo 

0 

0 

b x 

0 

0 

be 

be 

b 4 

b 3 

b 2 

bi 

bo 

0 

Bo 

0 

0 

0 

be 

be 

b 4 

b 3 

b 2 

bi 

bo 

Ci 

0 

0 

0 

0 

C 4 

c 3 

c 2 

Cl 

co 

0 

Co 

l 0 

0 

0 

0 

0 

c 4 

c 3 

c 2 

Cl 

co 


= M'; 


hence if we consider any submatrices Mo and Mq obtained by selecting six 
corresponding columns of M and M' we have 


K’K'K’ det Mo = zL det Mq. 

When M 0 is chosen to be the first six columns of M, we find that detM 0 — 
icf/frg — zb^ 3 > so hs is an integer. 

In general, to show that hj is an integer for j > 3 , we start with the matrix 
M consisting of rows A n2 — nj —i through Aq and £? ni _ nj _i through B 0 ; then 
we perform appropriate row operations until obtaining a matrix M' consisting of 
rowsB ni _ n ._i throughS n3 _ n ., C n2+n ._i through C n4 _ nj , P n ._ 2 _ n .„i 
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through Pq, through Qq. Letting Mq be the first ni + n 2 — 2 n 3 

columns of M, we obtain 

(gt' + 1 /g 1 h{T 2 ~ ni ( 93 + 1 / 92 htT i ~ ni ---( 9 p l+ 1 /g 3 -ih-r i T i ^ ns detM 0 

= ± 9 ?'~” 3 ffr - " 4 - • • ( 23 ) 


an equation that neatly simplifies to 

det M 0 = dz^j- ( 24 ) 

(This proof, although stated for the domain of integers, obviously applies to any 
unique factorization domain.) 

In the process of verifying Algorithm C, we have also learned that every ele¬ 
ment of S dealt with by the algorithm can be expressed as a determinant whose 
entries are the coefficients of the primitive parts of the original polynomials. A 
well-known theorem of Hadamard (see exercise 15 ) states that 



therefore an upper bound for the maximum coefficient appearing in the polyno¬ 
mials computed by Algorithm C is 

N m+n (m + l)”/ 2 (n + l) m/2 , ( 26 ) 

if all coefficients of the given polynomials u(x) and v(z) are bounded by N 
in absolute value. This same upper bound applies to the coefficients of all 
polynomials u(x) and v(x) computed during the execution of Algorithm E, since 
the polynomials obtained in Algorithm E are always divisors of the polynomials 
obtained in Algorithm C. 

This upper bound on the coefficients is extremely gratifying, because it is 
much better than we would ordinarily have a right to expect. For example, 
consider what happens if we avoid the corrections in steps E 3 and C 3 , merely 
replacing v(x) by r(x). This is the simplest gcd algorithm, and it is the one 
that traditionally appears in textbooks on algebra (for theoretical purposes, not 
intended for practical calculations). If we suppose that 6 i = < 5 2 = • * • = 1 , we 
find that the coefficients of Us(x) are bounded by N 3 , the coefficients of u^(x) 
are bounded by N 7 , those of u$(x) by AT 17 , ...; the coefficients of u k {x) are 
bounded by N ak , where a k — 2 a k —\ + a k — 2- Thus the upper bound, in place 
of ( 25 ) for m = n -}- 1 , would be approximately 

^ 0 . 5 ( 2 . 414 )^ ( 27 ) 

and experiments show that the simple algorithm does in fact have this behavior; 
the number of digits in the coefficients grows exponentially at each step! In 
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Algorithm E, by contrast, the growth in the number of digits is only slightly 
more than linear at most. 

Another byproduct of our proof of Algorithm C is the fact that the degrees of 
the polynomials will almost always decrease by 1 at each step, so that the number 
of iterations of step C2 (or E2) will usually be deg(v) if the given polynomials 
are “random.” In order to see why this happens, note for example that we could 
have chosen the first eight columns of M and M* in (16) and (17), and then we 
would have found that 14(2) has degree less than 3 if and only if d 3 = 0, that 
is, if and only if 

Cig U7 CLg CLg CI4 Q ,3 &2 Ul 

0 Og 0-7 0-6 G5 a 4 &3 &2 

0 0 Clg OL'j CLg CLg Q.4 U3 

65 64 63 62 61 60 0 

0 bg 65 64 63 62 64 60 

0 0 bg 65 64 63 62 64 

0 0 0 66 65 64 63 62 

0 0 0 0 69 65 64 63 

In general, 6 j will be greater than 1 for j > 1 if and only if a similar determinant 
in the coefficients of u(x) and v(x) is zero. Since such a determinant is a nonzero 
multivariate polynomial in the coefficients, it will be nonzero “almost always,” 
or “with probability 1.” (See exercise 16 for a more precise formulation of this 
statement, and see exercise 4 for a related proof.) The example polynomials in 
(15) have both 62 and 63 equal to 2, so they are exceptional indeed. 

The considerations above can be used to derive the well-known fact that 
two polynomials are relatively prime if and only if their “resultant” is nonzero; 
the resultant is a determinant having the form of rows A 5 through and Bj 
through Bo in Table 1. (This is “Sylvester’s determinant”; see exercise 12. 
Further properties of resultants are discussed in B. L. van der Waerden, Modern 
Algebra, tr. by Fred Blum (New York: Ungar, 1949), Sections 27-28.) From 
the standpoint discussed above, we could say that the gcd is “almost always” 
of degree zero, since Sylvester’s determinant is almost never zero. But many 
calculations of practical interest would never be undertaken if there weren’t some 
reasonable chance that the gcd would be a polynomial of positive degree. 

We can see exactly what happens during Algorithms E and C when the gcd 
is not 1 by considering u(x) — w(x)u\(x) and v(z) = w(z)u2(x), where ni(x) and 
U2(x) are relatively prime and w(x) is primitive. Then if the polynomials U\{x), 
U2(x), ug(x), ... are obtained when Algorithm E works on u(x) = Ui(x) and 
v(x) = 1x2 (x), it is easy to show that the sequence obtained for u(x) — w(x)ui(x) 
and v(x) = w(x)u2{x) is simply ic(x)ui(x), w(x)u 2(1), w(x)u 3 (x), w(x)u 4 {x), 

With Algorithm C the behavior is different; if the polynomials u\(x), U2(x), u 3 (x), 

... are obtained when Algorithm C is applied to u(x) = Ui(x) and v(x) = ^(x), 
and if we assume that deg(Uj-fi) = deg(n ; ) — 1 (which is almost always true 
when j > 1), then the sequence 

u>(x)ui(x), w(x)u 2 {x), E 2 w(x)u 3 (x), £ 4 w(x)u 4 {x), l 6 w(x)ug(x), ... 




(28) 
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is obtained when Algorithm C is applied to u(x) = w(x)ui(x ) and v(x) = 
w(x)u 2 (x), where t = t(w). (See exercise 13.) Even though these additional 
^-factors are present, Algorithm C will be superior to Algorithm E, because it is 
easier to deal with slightly larger polynomials than to calculate primitive parts 
repeatedly. 

Polynomial remainder sequences such as those in Algorithms C and E are 
not useful merely for finding greatest common divisors; another important ap¬ 
plication is to the enumeration of real roots, for a given polynomial in a given 
interval, according to the famous theorem of J. Sturm [Mem. presentes par divers 
savants 6 (Paris, 1835), 271-318]. Let u(x) be a polynomial over the real num¬ 
bers, having distinct roots. We shall see in the next section that this is the same 
as saying gcd (u{x), u'(x)) = 1, where w'(x) is the derivative of u(x); accordingly, 
there is a polynomial remainder sequence proving that u(x) is relatively prime 
to u'(x). We set uq(x) = u(x), U\{x) = «'(x), and (following Sturm) we negate 
the sign of all remainders: 

CiUo(x) = v>i(x)qi(x) — diu 2 (x), 
c 2 Ui(x) = u 2 (x)q 2 (x) — d 2 u 3 (x), 

(29) 

c k u k - i(x) = u k (x)q k (x) — d k u k +i{x), 

for some positive constants c 3 and d 3 , where deg(u k +i) = 0. We say that the 
variation V(u,a ) of u(x) at a is the number of changes of sign in the sequence 
wo(fl)) Ui(a), ..., u k +i(a), not counting zeros. For example, if the sequence of 
signs is 0, +, —, —, 0, -f, -f-, —, we have V(u, a) = 3. Sturm’s theorem asserts 
that the number of roots of u(x) in the interval a < x < b is V(u, a) — V(u, 6); 
and the proof is surprisingly short (see exercise 22). 

Although Algorithms C and E are interesting, they aren’t the whole story. 
Important alternative ways to calculate polynomial gcd’s over the integers are 
discussed at the end of Section 4.6.2. There is also a general determinant- 
evaluation algorithm that may be said to include Algorithm C as a special case; 
see E. H. Bareiss, Math. Comp . 22 (1968), 565-578. 


EXERCISES 

1. [10] Compute the pseudo-quotient q(x) and pseudo-remainder r{x), namely, the 
polynomials satisfying (8), when u(x ) = x 6 -f- x 5 — z 4 2x 3 -(- 3z 2 — x + 2 and 
v(x) = 2x 3 + 2x 2 — x -(- 3, over the integers. 

2. [15] What is the greatest common divisor of 3x 6 -f z 5 + 4z 4 + 4z 3 -f- 3z 2 -j- 4z + 2 
and its “reverse” 2z 6 + 4x 5 + 3x 4 4- 4z 3 + 4z 2 z + 3, modulo 7? 
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► 3. [ M25] Show that Euclid’s algorithm for polynomials over a field S can be extended 
to find polynomials U(x ) and V(x) over S such that 

u(z)F(z) -f- U(x)v(x) = gcd(u(z), u(z)). 

(Cf. Algorithm 4.5.2X.) What are the degrees of the polynomials U(x) and V(x) that 
are computed by this extended algorithm? Prove that if S is the field of rational 
numbers, and if u(z) = x m — 1 and v(x) = x n — 1, then the extended algorithm 
yields polynomials U(x) and V(x) having integer coefficients. Find U{x) and V(x) when 
u(x) = x 21 — 1 and v(x) = z 13 — 1. 

► 4. [MSO] Let p be prime, and suppose that Euclid’s algorithm applied to the poly¬ 
nomials u(z) and v(x) modulo p yields a sequence of polynomials having respective 
degrees m, n, n\, ..., n t , —oo, where m = deg(u), n = deg(u), and n t > 0. Assume 
that m > n. If u(z) and v(x) are monic polynomials, independently and uniformly 
distributed over all the p m+n pairs of monic polynomials having respective degrees 
m and n, what are the average values of the three quantities t, n\ -f- • • • + n t , and 
(n — ni)ni + ■ • • + (n t —i — n t )n t , as functions of m, n, and p? (These three quan¬ 
tities are the fundamental factors in the running time of Euclid’s algorithm applied to 
polynomials modulo p, assuming that division is done by Algorithm D.) [Hint; Show 
that u(x) mod v(x) is uniformly distributed and independent of v(z).] 

5. [ M22] What is the probability that u(x) and v{x) are relatively prime modulo p, 
if u(x) and u(z) are independently and uniformly distributed monic polynomials of 
degree n? 

6. [M23] We have seen that Euclid’s Algorithm 4.5.2A for integers can be directly 
adapted to an algorithm for the greatest common divisor of polynomials. Can the 
“binary gcd algorithm,” Algorithm 4.5.2B, be adapted in an analogous way to an 
algorithm that applies to polynomials? 

7. [ M10 ] What are the units in the domain of all polynomials over a unique fac¬ 
torization domain 5? 

► 8. [ M22} Show that if a polynomial with integer coefficients is irreducible over the 
domain of integers, it is irreducible when considered as a polynomial over the field of 
rational numbers. 

9. [ M25} Let u(x) and v(x ) be primitive polynomials over a unique factorization 
domain S. Prove that u(x) and v(x ) are relatively prime if and only if there are 
polynomials U(x) and V(x) over S such that u(x)V(x) -f- U(x)v(x) is a polynomial of 
degree zero. [Hint: Extend Algorithm E, as Algorithm 4.5.2E is extended in exercise 3.] 

10. [ M28 ] Prove that the polynomials over a unique factorization domain form a 
unique factorization domain. [Hint: Use the result of exercise 9 to help show that there 
is at most one kind of factorization possible.] 

11. [ M22] What row names would have appeared in Table 1 if the sequence of degrees 
had been 9, 6, 5, 2, —oo instead of 8, 6, 4, 2, 1, 0? 

► 12. [ M24] Let u\(x), U 2 {x), u^x), ... be a sequence of polynomials obtained during a 
run of Algorithm C. “Sylvester’s matrix” is the square matrix formed from rows A n2 —i 
through Aq and H ni _i through Ho (in a notation analogous to that of Table 1). Show 
that if ui(z) and n 2 (z) have a common factor of positive degree, then the determinant 
of Sylvester’s matrix is zero; conversely, given that deg(ufc) = 0 for some k, show that 
the determinant of Sylvester’s matrix is nonzero by deriving a formula for its absolute 
value in terms of i{u 3 ) and deg(uj ), 1 <j <k. 
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13. [M22] Show that the leading coefficient l of the primitive part of gcd(ii(z), v(x )) 
enters into Algorithm C’s polynomial sequence as shown in (28), when <5i = 6 2 = • • ■ = 
<5 ^—1 = 1 . What is the behavior for general <5-,? 

14. [ M29] Let r(x ) be the pseudo-remainder when u(x) is pseudo-divided by v(x). If 
deg('u) > deg(v) -f- 2 and deg(v) > deg(r) -j- 2 , show that r(x) is a multiple of £(v). 

15. [M26] Prove Hadamard’s inequality (25). [Hint: Consider the matrix AA T .} 

16. [ HM22 ] Let f(x 1 ,..., x n ) be a multivariate polynomial with real coefficients not 
all zero, and let a,N be the number of solutions to the equation f(x 1 ,..., x n ) = 0 such 
that |zi| < N, ..., |z n | < N, and such that each Xj is an integer. Prove that the 
roots have zero density, i.e., that limjv_.oo ajv/(27V -j- l) n = 0. 

17. [MS2} ( P . M. Cohn’s algorithm for division of string polynomials.) Let A be an 
“alphabet,” i.e., a set of symbols. A string a on A is a sequence of n > 0 symbols, 
a = di ... a n , where each aj is in A. The length of a, denoted by |a|, is the number n 
of symbols. A string polynomial on A is a finite sum U = J2k Tkak > where each rt is 
a nonzero rational number and each ock is a string on A. The degree of U, deg([/), is 
defined to be — 00 if U — 0 (i.e., if the sum is empty), otherwise deg(u) = max |c*fc|. 
The sum and product of string polynomials are defined in an obvious manner, e.g., 
(Hj r J a i)(I2k Sk &k) = Yhj,k r j s kOijPk, where the product of two strings is obtained by 
simply juxtaposing them. For example, if A — {a, 6 }, U — ab -j- ba — 2a — 2 b, and 
V — a + b — 1, then deg(H) = 2, deg(V) — 1, V 2 = aa + afc-f ba + bb — 2a — 2b + 1 , 
and V 2 — U = aa + bb -(-1. Clearly, deg(HV) = deg(H) + deg(F), and deg(H + V) < 
max(deg(H), deg(V")), with equality in the latter formula if deg(H) 7 ^ deg(V). (String 
polynomials may be regarded as ordinary multivariate polynomials over the field of 
rational numbers, except that the variables are not commutative under multiplication. 
In the conventional language of pure mathematics, the set of string polynomials with 
the operations defined here is the “free associative algebra” generated by A over the 
rationals.) 

a) Let Qi, Q 2 , U, V be string polynomials with deg(H) > deg(y) and such that 
deg(QiH — Q 2 VO < deg(QiU). Give an algorithm to find a string polynomial Q 
such that deg(C7 — QV ) < deg(H). (Thus if we are given U and V such that 
QiU = Q 2 V + R and deg(H) < deg(QitZ), for some Q 1 and Q 2 , then there is a 
solution to these conditions with Q 1 = 1 .) 

b) Given that U and V are string polynomials with deg(V) > deg(QiC7 — Q 2 V) for 
some Qi and Q 2 , show that the result of (a) can be improved to find a quotient Q 
such that U — QV + R, deg(H) < deg(V). (This is the analog of (1) for string 
polynomials; part (a) showed that we can make deg(H) < deg(H), under weaker 
hypotheses.) 

c) A “homogeneous” polynomial is one whose terms all have the same degree (length). 
If Hi, U 2 , Vi, V 2 are homogeneous string polynomials with Hi Vi = H 2 V 2 and 
deg(Vi) > deg(V 2 ), show that there is a homogeneous string polynomial U such 
that U 2 ~U\U and Vi = UV 2 . 

d) Given that U and V are homogeneous string polynomials with UV = VU, prove 
that there is a homogeneous string polynomial W such that U = rW m , V = sW n 
for some integers m, n and rational numbers r, s. Give an algorithm to compute 
such a W having the largest possible degree. (This algorithm is of interest, for 
example, when U = a and V = (3 are strings satisfying a(3 = /3a; then W is 
simply a string 7 . When U = x m and V = x n , the solution of largest degree is 
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the string W = £ gcd(m,n) , so this algorithm includes a gcd algorithm for integers 
as a special case.) 

► 18. [M 24 ] ( Euclidean algorithm for string polynomials.) Let Vi and V 2 be string 
polynomials, not both zero, having a “common left multiple.” (This means that there 
exist string polynomials Ui and U2, not both zero, such that U1V1 — U2V2.) The 
purpose of this exercise is to find an algorithm to compute their “greatest common 
right divisor” gcrd(Vi, V 2 ) as well as their “least common left multiple” lclm(Vi,V 2 ). 
The latter quantities are defined as follows: gcrd(Vi, V 2 ) is a common right divisor 
of Vi and V 2 (that is, Vi = W\ gcrd(Vi, V 2 ) and V 2 — W 2 gcrd(Vi, V 2 ) for some W\ 
and W 2 ), and any common right divisor of Vi and V 2 is a right divisor of gcrd(Vi, V 2 ); 
lclm(Vi, V 2 ) = Z\V\ — Z 2 V 2 for some Z\ and Z 2 , and any common left multiple of Vi 
and V 2 is a left multiple of lclm(Vx, V 2 ). 

For example, let U\ = abbbab-\-abbab — bbab-\-ab — 1 , Vi = babab-\-abab-{-ab — b ; 
U 2 = abb-{-ab — b, V 2 — babbabab -\- bababab -f- babab + abab — babb — 1. Then we have 
U\ Vi = U 2 V 2 — abbbabbabab -)- abbabbabab -j- abbbababab -|- abbababab — bbabbabab 4 - 
abbbabab — bbababab 2 abbabab — abbbabb -f- ababab — abbabb — bbabab — babab -|- 
bbabb — abb — ab-\-b. For these string polynomials it can be shown that gcrd(Vi, V 2 ) ~ 
ab-\- 1, and lclm(Vi, V 2 ) — U\V\. 

The division algorithm of exercise 17 may be restated thus: If Vi and V 2 are string 
polynomials, with V 2 jZ 0 , and if Ui yZ 0 and U 2 satisfy the equation IRV 1 = U 2 V 2 , 
then there exist string polynomials Q and R such that 


Vl = QV 2 4- R, where deg (R) < deg(V 2 ). 


It follows readily that Q and R are uniquely determined; they do not depend on the 
given U\ and U 2 . Furthermore the result is right-left symmetric, in the sense that 

U 2 = UiQ + R', where deg(H') = deg(Ui) — deg(V 2 ) + deg(H) < deg(Ui). 


Show that this division algorithm can be extended to an algorithm that computes 
lclm(Vi, V 2 ) and gcrd(V, V 2 ); in fact, the extended algorithm finds string polynomials 
Z\ and Z 2 such that Z{V\ + Z 2 V 2 = gcrd(Vi, V 2 ). [Hint: Use auxiliary variables ui , 
v- 2 , vt, v 2 , w 1 , w 2 , w[, w' 2 , 2 1 , z 2 , z[, z' 2 , whose values are string polynomials; start by 
setting ui <— Ui, u 2 U 2 , v\ <— Vi, v 2 V 2 , and throughout the algorithm maintain 
the conditions 


U1W1 4 - U2W2 = u\, 

Uiw[ 4 - U 2 u / 2 = u 2 , 
u x z 1 — u 2 z[ — (—1 ) n Ui, 
—U 1 Z 2 4“ u 2^2 — (—l) n H 2 , 


^lVi 4- Z 2 V 2 = Vi, 
z\Vi 4 - z'^y 2 = v 2 , 
w\V\ — w\v 2 — (—l) n Vi 
—W2V1 4- V) 2 V2 — (— l) n V 2 


at the nth iteration. This might be regarded as the “ultimate” extension of Euclid’s 
algorithm.] 

19. [MS9] ( Common divisors of square matrices.) Exercise 18 shows that the con¬ 
cept of greatest common right divisor can be meaningful when multiplication is not 
commutative. Prove that any two nXn matrices A and B of integers have a greatest 
common right matrix divisor D. [Suggestion: Design an algorithm whose inputs are 
A and B, and whose outputs are integer matrices D, P, Q, X, Y, where A = PD, 
B = QD, and D = XA-\-YB] Find a greatest common right divisor of the matrices 
(3 4) and (J?). 
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20. [M40] Investigate the accuracy of Euclid’s algorithm: What can be said about 
calculation of the greatest common divisor of polynomials whose coefficients are floating 
point numbers? 

21 . [ M25 ] Prove that the computation time required by Algorithm C to compute the 
gcd of two nth degree polynomials over the integers is 0 (n 4 (log Nn ) 2 ), if the coefficients 
of the given polynomials are bounded by N in absolute value. 

22 . [ M23 ] Prove Sturm’s theorem. [Hint: Some sign sequences are impossible.] 

23. [M22] Prove that if u(x) in (29) has deg(n) real roots, then we have deg(iij+i) = 
deg('Uj) — 1 for 0 < j < k. 

24. [ M21 ] Show that (19) simplifies to (20) and (23) simplifies to (24). 

25. [M24] (W. S. Brown.) Prove that all the polynomials Uj(x) in (16) for j > 3 are 
multiples of gcd(^(u), £{v)), and explain how to improve Algorithm C accordingly. 

► 26. [ M26] The purpose of this exercise is to give an analog for polynomials of the 
fact that continued fractions with positive integer entries give the best approximations 
to real numbers (exercise 4.5.3-42). 

Let u(x) and v(x) be polynomials over a field, with deg(u) > deg(v), and let 
a\{x), a. 2 (x), ... be the quotient polynomials when Euclid’s algorithm is applied to u(x) 
and v(x). For example, the sequence of quotients in (5) and ( 6 ) is 9x 2 -f- 7, 5x 2 -f- 5, 
6 x 3 -f- 5x 2 -j- 6 x + 5, 9x -\- 12. We wish to show that the convergents p n {x)/q n {x ) of 
the continued fraction lai(x),a 2 {x ),... / are the “best approximations” of low degree 
to the rational function v{x)/u{x), where we have p n {x) = Q n —i (0.2(2:), ..., a n (x)) and 
q n {x) = Q n {cLi{x),... ,an{x)) in terms of the continuant polynomials of Eq. 4.5.3-4. 
By convention, we let po{x) = q~i{x) = 0, p~i{x) = qo(x) = 1. 

Prove that if p(x) and q(x) are polynomials such that deg(q) < deg(q n ) and 
deg(pu — qv ) < deg(p n _iu — q n — 1 v), for some n > 1 , then p(x) — cp n —i{x) and 
q{x) = cq n —i{x) for some constant c. In particular, each q n (x) is a “record-breaking” 
polynomial in the sense that no nonzero polynomial q(x) of smaller degree can make 
the quantity p(x)u(x) — q(x)v(x), for any polynomial p{x), achieve a degree as small as 
Pn{x)u{x) — q n (x)v(x). 


*4.6.2. Factorization of Polynomials 

Let us now consider the problem of factoring polynomials, not merely finding 
the greatest common divisor of two or more of them. 

Factoring modulo p. As in the case of integer numbers (Sections 4.5.2, 4.5.4), 
the problem of factoring seems to be more difficult than finding the greatest 
common divisor. But factorization of polynomials modulo a prime integer p is 
not as hard to do as we might expect. It is much easier to find the factors of an 
arbitrary polynomial of degree n, modulo 2, than to use any known method to 
find the factors of an arbitrary n-bit binary number. This surprising situation 
is a consequence of an instructive factorization algorithm discovered in 1967 by 
Elwyn R. Berlekamp [Beil System Technical J. 46 (1967), 1853-1859]. 

Let p be a prime number; all arithmetic on polynomials in the following dis¬ 
cussion will be done modulo p. Suppose that someone has given us a polynomial 
it(ac), whose coefficients are chosen from the set {0,1,..., p — 1}; we may assume 
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that u(x) is monic. Our goal is to express u(x) in the form 

u{x) = p 1 {x) ei ...p r (x) e % (1) 

where pi{x), ..., p r (x ) are distinct, monic, irreducible polynomials. 

As a first step, we can use a standard technique to determine whether any 
of the exponents e \, ..., e r are greater than unity. If 

u(x) = u n x n H-b = v(x) 2 w(x), (2) 

then its “derivative” formed in the usual way (but modulo p) is 

u'(x ) = nu n x n ~ l -j-b — 2 v(x)v'(x)w(x) -b v(x) 2 w'(x), (3) 

and this is a multiple of the squared factor v(x). Therefore our first step in 
factoring u(x) is to form 


gcd (u(x),u'(x)) = d(x). (4) 

If d(x) is equal to 1, we know that u(x) is “squarefree,” the product of distinct 
primes pi{x).. .p r (x). If d(x) is not equal to 1 and d(x) ^ u(x), then d(x) is a 
proper factor of u(x); the relation between the factors of d(x) and the factors of 
u(x)/d(x) speeds up the factorization process nicely in this case (see exercise 34). 
Finally, if d(x) = u(x), we must have u'(x ) = 0; hence the coefficient u^ of x k 
is nonzero only when k is a multiple of p . This means that u(x) can be written 
as a polynomial of the form v(x p ), and in such a case we have 

u{x) = v(x”) = (v{x)) p ; ( 5 ) 

the factorization process can be completed by finding the irreducible factors 
of v(x) and raising them to the pth power. 

Identity (5) may appear somewhat strange to the reader; it is an important 
fact that is basic to Berlekamp’s algorithm and to several other methods we 
shall discuss. We can prove it as follows: If V\{x) and V2(x) are any polynomials 
modulo p, then 

0 >i(z) + v 2 (x)) p = vi(x) p -f ( p 1 )v 1 (x) p - 1 v 2 {x) 

H-b + v 2 (x) p 

= Vi {x) p + v 2 (x) p , 

since the binomial coefficients (f), ..., ( are all multiples of p. Furthermore 
if a is any integer, we have a p = a (modulo p) by Fermat’s theorem. Therefore 
when v(x) = v m x m + v m ^ix rn ~ 1 -f-b Vo, we find that 

v(x) p = (v m x m ) p -b (v m -ix m ~ x ) p + • • ■ + (vo) p 

= v m x mp - l ^ p -b • • • -b t>o = ^(z p ). 
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The above remarks show that the problem of factoring a polynomial reduces 
to the problem of factoring a squarefree polynomial. Let us therefore assume 
that 

u{x) = pi(x)p 2 (x)... p r {x ) (6) 

is the product of distinct primes. How can we be clever enough to discover the 
Pj( x)’s when only u(x) is given? Berlekamp’s idea is to make use of the Chinese 
remainder theorem, which is valid for polynomials just as it is valid for integers 
(see exercise 3). If ($i, s 2 , • • • , s r ) is any r-tuple of integers mod p, the Chinese 
remainder theorem implies that there is a unique polynomial v(x) such that 

v(x) = s i (modulo pi (a;)), ..., v(x) = s r (modulo p r (x)), ^ 

deg(v) < deg(pi) + deg(p 2 ) H-h deg(p r ) = deg(u). 

The notation g(x) = h(x) (modulo /(x)) that appears here is the same as 
u g(x) = h(x) (modulo /(x) and p)” in exercise 3.2.2-11, since we are considering 
polynomial arithmetic modulo p. The polynomial v(x) in (7) gives us a way to get 
at the factors of u(x), for if r > 2 and si s 2 , we will have gcd(u(x), v(x) — Si) 
divisible by Pi(x) but not by p 2 (x). 

Since this observation shows that we can get information about the factors 
of u{x) from appropriate solutions v(x) of (7), let us analyze (7) more closely. 
In the first place we can observe that the polynomial v{x) satisfies the condition 
v(x) p = s p = Sj = v(x) (modulo pj(x )) for 1 < j < r, therefore 

v(x) p = v(x) (modulo w(x)), deg(v) < deg(u). (8) 

In the second place we have the basic polynomial identity 

x p — x = (x — 0)(x — 1)... (x — (p — 1)) (modulo p) (9) 

(see exercise 6); hence 

v (x) p — ?;(x) — (v(x) — 0)(v(x) — l)... (y(x) — (p — 1)) (10) 

is an identity for any polynomial v(x), when we are working modulo p. If v(x) 
satisfies (8), it follows that u(x) divides the left-hand side of (10), so every 
irreducible factor of u(x) must divide one of the p relatively prime factors of 
the right-hand side of (10). In other words, all solutions of (8) must have the 
form of (7), for some Si, s 2 , ..., s r ; there are exactly p r solutions of ( 8). 

The solutions v(x) to congruence (8) therefore provide a key to the factoriza¬ 
tion of u(x). It may seem harder to find all solutions to (8) than to factor u(x) 
in the first place, but in fact this is not true, since the set of solutions to (8) is 
closed under addition. Let deg(u) = n; we can construct the n X n matrix 

( <7o,o <7o,i ••• Qo,n —1 \ 

: : I (11) 

Qn— 1,0 Qn —1,1 ••• Qn — l,n —1/ 
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where 

x pk = 4-(- q k> IX + q k ,o (modulo u{x)). (12) 

Then v(x) = v n ~ix n ~ 1 -|-f- V\X -f- Vo is a solution to ( 8 ) if and only if 

{Vo,Vi, . . . , Vn-l)Q = K, Vi, . ■ • , Vn-l); (13) 

for the latter equation holds if and only if 

v ( x ) =E v i xi =E E v k q k)j x J = E v k x pk — v(x p ) = i»(:r) p (modulo w(x)). 

3 3 k k 

Berlekamp’s factoring algorithm therefore proceeds as follows: 

Bl. Ensure that u(x) is squarefree; i.e., if gcd(u(x), u'(x)) 7 ^ 1, reduce the 
problem of factoring u(x), as stated earlier in this section. 

B2. Form the matrix Q defined by (11) and (12). This can be done in one of two 
ways, depending on whether or not p is very large, as explained below. 

B3. “Triangularize” the matrix Q — I, where I = (6^) is the n X n identity 
matrix, finding its rank n — r and finding linearly independent vectors 
.... t)M such that v^{Q-I) = (0,0,..., 0) for 1 < j < r. (The first vector 
may always be taken as ( 1 , 0 , ..., 0 ), representing the trivial solution 
v^(x) — 1 to (8). The “triangularization” needed in this step can be done 
using appropriate column operations, as explained in Algorithm N below.) 
At this point , r is the number of irreducible factors of u(x), because the 
solutions to (8) are the p r polynomials corresponding to the vectors t\v^ -(- 

-1- t r v M for all choices of integers 0 < ti ,..., t r < P- Therefore if r = 1 

we know that u(x) is irreducible, and the procedure terminates. 

B4. Calculate gcd(u(z), v^(x)~s) for 0 < s < p, where v^(x) is the polynomial 
represented by vector The result will be a nontrivial factorization of 
u(x), because v^(x) — s is nonzero and has degree less than deg(zt), and by 
exercise 7 we have 


u{x) = JJ gcd(i;( 2 :) — s, u(x)) 

0 < s <p 


(14) 


whenever v(x) satisfies ( 8 ). 

If the use of i;^(:r) does not succeed in splitting u{x) into r factors, 
further factors can be obtained by calculating gcd(ft^(:r) — s, w(x )) for 
0 < s < p and all factors w{x) found so far, for k = 3, 4, ..., until r factors 
are obtained, (if we choose s% 7 ^ Sj in (7), we obtain a solution v(x) to ( 8 ) 
that distinguishes pi{x) from p 3 (x)\ some ^(z) —s will be divisible by Pi(x) 
and not by Pj{x), so this procedure will eventually find all of the factors.) 

If p is 2 or 3, the calculations of this step are quite efficient; but if p is 
more than 25, say, there is a much better way to proceed, as we shall see 
later. | 
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As an example of this procedure, let us now determine the factorization of 

u(x) — x 8 + x 6 + 10z 4 + 10z 3 + 8x 2 -f 2x + 8 (15) 

modulo 13. (This polynomial appears in several of the examples in Section 4.6.1.) 
A quick calculation using Algorithm 4.6.IE shows that gcd(u(z), u'(x)) = 1; 
therefore u(x) is squarefree, and we turn to step B2. Step B2 involves calculating 
the Q matrix, which in this case is an 8 X 8 array. The first row of Q is always 
(1,0,0,. ..,0), representing the polynomial x°modu(i) = 1. The second row 
represents z 13 mod u(x), and, in general, x k mod-u(z) may readily be determined 
as follows (for relatively small values of k ): If 

U(x) = X n + iZ n_1 -|-1- UiX + Uq 


and if 

x k = <2 fc ,n- iz n_1 H-j- a k> iz + a kj0 (modulo u(z)), 

then 

zfc+i = a ktn _ lX n -j-f- a kil x 2 -f a k>0 x 

= ^k,n— l{~~U n — 1 — ' ’ • — U\X — Uq ) -f- Ufc,n_ 2% n 1 “b * * ' 0>k,0% 

= ^k-\-l,n — l% n 1 “b * “ ~h a k + l,l x Gfc-f-l.Oj 

where 

= — 1 Q> k,n —1 'ttj. (16) 

In this formula a k ,—i is treated as zero, so that a k - (-i,o = — 1 a>k,n— i^o- The 
simple “shift register” recurrence (16) makes it easy to calculate z 1 , z 2 , z 3 , 
mod u(x). Inside a computer, this calculation is of course generally done by 
maintaining a one-dimensional array (a n __i,..., a x , ao) and repeatedly setting 
t <— a n — i, a n _i <— (a n _ 2 —iu n _i)modp, ..., a x <— (a 0 —£G X )modp, and 
Go <— (— tuo) modp. (We have seen similar procedures in connection with random 
number generation; cf. Eq. 3.2.2-10.) For the example polynomial u(x) in (15), 
we obtain the following sequence of coefficients of x k modn(z), using arithmetic 
modulo 13: 

k a k , 7 Gfc, 6 CLk, 5 Gfc, 4 Gfc,3 Gfc,2 Ofc.O 

000000001 
1 0 0 0 0 0 0 1 0 

200000100 
300001000 
400010000 
500100000 
601000000 

7 10000000 

8 0 12 0 3 3 5 11 5 

9 12 0 3 3 5 11 50 

10 04328028 

11 43280280 

12 3 11 8 12 1 2 5 7 

13 11 5 12 10 11 7 12 
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Therefore the second row of Q is (2,1,7,11,10,12,5,11). Similarly we may 
determine x 26 modw(x), , x 91 modu(x), and we find that 


/I 0 0 0 0 0 0 0\ 

2 1 7 11 10 12 5 11 

36430472 
Q= 4 3 6 5 1 6 2 3 

^ 2 11 8 8 3 1 3 11 ’ 

6 11 8 6 2 7 10 9 

5 11 7 10 0 11 7 12 

V3 3 12 5 0 11 9 12/ 

/0 0 0 0 0 0 0 0 \ 

2 0 7 11 10 12 5 11 

36330472 
n-T= 43641623 
^ 2 11 8 8 2 1 3 11 ' 

6 11 8 6 2 6 10 9 

5 11 7 10 0 11 6 12 

V3 3 12 5 0 11 9 11/ 


(17) 


That finishes step B2; the next step of Berlekamp’s procedure requires 
finding the “null space” of Q — I. In general, suppose that A is an n X n 
matrix over a field, whose rank n — r is to be determined; suppose further that 
we wish to determine linearly independent vectors i/W, ..., v ^ such that 
tfl 1 -A = v^A = • • • — v^A = (0,... ,0). An algorithm for this calculation 
can be based on the observation that any column of A may be multiplied by 
a nonzero quantity, and any multiple of one of its columns may be added to a 
different column, without changing the rank or the vectors ..., z/M. (These 
transformations amount to replacing A by AB, where B is a nonsingular matrix.) 
The following well-known “triangularization” procedure may therefore be used. 

Algorithm N (Null space algorithm). Let A be an n X n matrix, whose elements 
a,ij belong to a field and have subscripts in the range 0 < i,j < n. This 
algorithm outputs r vectors ..., t/M, which are linearly independent over 
the field and satisfy v^A = (0,..., 0), where n — r is the rank of A. 

Nl. [Initialize.] Set Co <— c\ +-•••<— c n _i <-1, r «— 0. (During the calculation 

we will have Cj > 0 only if a Cj j = —1 and all other entries of row Cj are 
zero.) 

N2. [Loop on A;.] Do step N3 for k = 0, 1, ..., n — 1, and then terminate the 
algorithm. 

N3. [Scan row for dependence.] If there is some j in the range 0 < j < n such 
that aicj 7 ^ 0 and Cj < 0, then do the following: Multiply column j of A by 
— 1/a/cj (so that a^j becomes equal to —1); then add dki times column j to 
column i for all i 7 ^ j; finally set c 3 <— k. (Since it is not difficult to show 
that a S j = 0 for all s < k, these operations have no effect on rows 0, 1, ..., 
k — 1 of A.) 
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On the other hand, if there is no j in the range 0 < j < n such that 
CLkj t^O and cy < 0 , then set r +— r -j- 1 and output the vector 

V [r] = {v 0 ,Vi } ...,V n -i) 

defined by the rule 

{ a ks , if c s =j> 0 ; 

1 , if j = k; (18) 

0 , otherwise. | 

An example will reveal the mechanism of this algorithm. Let A be the matrix 
Q — I of (17) over the field of integers modulo 13. When k = 0 , we output the 
vector t/W = ( 1 , 0,0,0,0,0,0, 0). When k = 1, we may take j in step N3 to be 
either 0, 2, 3, 4, 5, 6 , or 7; the choice here is completely arbitrary, although it 
affects the particular vectors that are chosen to be output by the algorithm. For 
hand calculation, it is most convenient to pick j = 5, since ai 5 = 12 = —1 
already; the column operations of step N3 then change A to the matrix 

/0000000 0 \ 

0 0 0 0 0 ( 12 ) 0 0 
11 6 5 8 1 4 1 7 

33959664 
4 11 2 6 12 1 8 9 ‘ 

5 11 11 7 10 6 1 10 

1 11 6 1 6 11 9 3 

V 12 3 11 9 6 11 12 2 

(The circled element in column “5”, row “ 1 ”, is used here to indicate that 
C 5 = 1 . Remember that Algorithm N numbers the rows and columns of the 
matrix starting with 0, not 1.) When k = 2, we may choose j = 4 and proceed 
in a similar way, obtaining the following matrices, which all have the same null 
space as Q — I: 

k — 2 k = 3 

/0000000 0\/0000000 0 \ 
00000 ( 12)00 00000 ( 12)00 

0000 ( 12 ) 000 0000 ( 12)000 

813 11 49 10 6 0(12)000000 

24711593 9989 11 885 

12 3053545 1 10 4 11 4400 

01257030 5 12 12 73467 

Vll 6 7 0 7 0 6 12 / V 2 7 2 12 9 11 11 2 / 
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k = 4 k = 5 

/0000000 0\/0000000 0 \ 
00000 ( 12)00 00000@00 

0000 ( 12)000 0000@000 

0 ( 12)000000 0 ( 12 ) 0 0 0 0 0 0 

0000000 ( 12 ) 0000000 (g) 

1 10 4 11 4400 @0000000 

826 10 11 11 09 50005509 

V 1 6 4 11 2 0 0 10/ Vl2 9 0 0 11 9 0 10/ 

Now every column that has no circled entry is completely zero; so when k = 6 
and k — 7 the algorithm outputs two more vectors, namely 

v l2i = (0,5,5,0,9,5,1,0), t)I 3 l = (0,9,11,9,10,12,0,1). 

From the form of matrix A after k = 5, it is evident that these vectors satisfy 
the equation vA = (0,..., 0). Since the computation has produced three linearly 
independent vectors, u(x) must have exactly three irreducible factors. 

Finally we can go to step B4 of the factoring procedure. The calculation of 
gcd(ti(x), —s) forO < s < 13, where v^ 2 \x) = x 6 -\-Sx 5 -\-9x 4 ~\-5x 2 -\-bx, 
gives x 5 + 5a: 4 -j- 9x 3 -f-5x 5 as the answer when s = 0, and 2 r 3 + 8x 2 —|—4x 12 
when s = 2; the gcd is unity for other values of s. Therefore v^ 2 \x) gives us only 
two of the three factors. Turning to gcd^^fz) — s, x 5 -j- 5x 4 -f- 9x 3 -f- 5x -j- 5), 
where v [3 ](z) = x 7 + 12z 5 + 10z 4 + 9x 3 -|- llx 2 -j- 9x, we obtain the value 
x 4 + 2x 3 + 3z 2 4x + 6 when s = 6, x + 3 when s — 8, and unity otherwise. 
Thus the complete factorization is 

u(x) — ( x 4 + 2z 3 + Sx 2 + Ax-\- 6)(x 3 + 8a: 2 + 4a: + 12)(x + 3). (19) 

Let us now estimate the running time of Berlekamp’s method when an nth 
degree polynomial is factored modulo p. First assume that p is relatively small, 
so that the four arithmetic operations can be done modulo p in essentially a fixed 
length of time. (Division modulo p can be converted to multiplication, by storing 
a table of reciprocals as suggested in exercise 9; for example, when working 
modulo 13, we have J = 7, ^ = 9, etc.) The computation in step B1 takes 0(n 2 ) 
units of time; step B2 takes 0(pn 2 ). For step B3 we use Algorithm N, which 
requires 0(n 3 ) units of time at most. Finally, in step B4 we can observe that the 
calculation of gcd(/(rr), g(xfj by Euclid’s algorithm takes 0(deg(/) deg^q)) units 
of time; hence the calculation of gcd(^(:r) — s, w{x)) for fixed j and s and for 
all factors w(x) of u(x ) found so far takes 0(n 2 ) units. Step B4 therefore requires 
0(prn 2 ) units of time at most. Berlekamp’s procedure factors an arbitrary 
polynomial of degree n, modulo p, in 0(n 3 -j-prn 2 ) steps, when p is a small prime; 
and exercise 5 shows that the average number of factors, r, is approximately In n. 
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Thus the algorithm is much faster than any known methods of factoring n-digit 
numbers in the p-ary number system. 

Of course, when n and p are small, a trial-and-error factorization procedure 
analogous to Algorithm 4.5.4A will be even faster than Berlekamp’s method. 
Exercise 1 implies that it is a good idea to cast out factors of small degree first 
when p is small, before going to any more complicated procedure, even when n 
is large. 

When p is large, a different implementation of Berlekamp’s procedure would 
be used for the calculations. Division modulo p would not be done with an 
auxiliary table of reciprocals; instead the method of exercise 4.5.2-15, which 
takes 0((logp) 2 ) steps, would probably be used. Then step B1 would take 
0(n 2 (logp) 2 ) units of time; similarly, step B3 takes 0(n 3 (logp) 2 ). In step B2, 
we can form x p mod u(x) in a more efficient way than (16) when p is large: 
Section 4.6.3 shows that this value can essentially be obtained by using O(logp) 
operations of “squaring modw(x),” i.e., going from x k modu(z) to x 2k modii(x). 
The squaring operation is relatively easy to perform if we first make an auxiliary 
table of x m mod u(x) for m = n, n -J- 1, ..., 2n — 2; if 

x k modu(x) = c n _ix n—1 -|-1- c^x -)- c 0 , 


then 


x 2k modif(z) — ( c\_ x x 2n 2 -\ -f- (ciCo + CiCo)x + Cq) mod u(x), 

where x 2n ~ 2 , ..., x n can be replaced by polynomials in the auxiliary table. The 
total time to compute 2 p modi£(x) comes to 0(n 2 (logp) 3 ) units, and we obtain 
the second row of Q. To get further rows of Q, we can compute x 2p modu(x), 
x 3p modu(x), ..., simply by multiplying repeatedly by x p modix(x), in a fashion 
analogous to squaring mod u(x); step B2 is completed in O(n 2 (logp) 3 ) units of 
time. Thus steps Bl, B2, and B3 take a total of 0(n 2 (logp) 3 -f- n 3 (logp) 2 ) units 
of time; these three steps tell us the number of factors of u(x). 

But when p is large and we get to step B4, we are asked to calculate a greatest 
common divisor for p different values of s, and that is out of the question if p 
is even moderately large. This hurdle was first surmounted by Hans Zassenhaus 
[J. Number Theory 1 (1969), 291-311], who showed how to determine all of the 
“useful” values of s (see exercise 14); but an even better way to proceed was 
found by Zassenhaus and Cantor in 1980. If v(x) is any solution to (8), we know 
that u(x) divides v(x) p — v(x) = v(x) • (y{xf p ~ 1 ^ 2 -f-1) • (v(x)( p— 1 ^/ 2 — l). This 
suggests that we calculate 


gcd(ii(x),?;(x)( p 1 ^ 2 — l); (20) 

with a little bit of luck, (20) will be a nontrivial factor of u{x). In fact, we can 
determine exactly how much luck is involved, by considering (7). Let u(x) = Sj 
(modulo pj(x)) for 1 < j < r; then Pj(x) divides v(x)( p ~~ 1 ^ 2 — 1 if and only if 
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s jp i)/2 s i (modulo p). We know that exactly (p — l)/2 of the integers s in 

the range 0 < s < p satisfy = 1 (modulo p), hence about half of the 

pj(x) will appear in the gcd (20). More precisely, if v(x) is a random solution 
of (8), where all p r solutions are equally likely, the probability that the gcd (20) 
equals u(x) is exactly 

(iP — l)/2 pY, 

and the probability that it equals 1 is ((p + l)/2p) r . The probability that a 
nontrivial factor will be obtained is therefore 



for all r > 2 and p > 3. 

It is therefore a good idea to replace step B4 by the following procedure, 

unless p is quite small: Set v(x )«— flivW(x) -f- a 2 v^(x) -\ --j- o, r v^(x), where 

the coefficients a : are randomly chosen in the range 0 < aj < p. Let- the current 
partial factorization of u(x) be iti(x)... u t (x) where t is initially 1. Compute 


gi{x) — gcd (ui(x), v(x) ip 1)/2 — 1) 


for all i such that deg(iq) > 1; replace Ui(x) by gi(x) • (ui(x) / gi(x)) and increase 
the value of t, whenever a nontrivial gcd is found. Repeat this process for 
different choices of v(x) until t — r. 

If we assume (as we may) that only O(logr) random solutions v(x) to (8) 
will be needed, we can give an upper bound on the time required to perform 
this alternative to step B4. It takes 0(r(logp) 2 ) steps to compute u(x); and if 
deg(uj) — d , it takes 0(d 2 (logp) 3 ) steps to compute v{x)^ p ~ 1 ^ 2 mod it,(x) and 
0(d 2 (logp) 2 ) further steps to compute gcd(u ? (x), v(x)^ p ~ 1 ^ 2 — l). Thus the 
total time is 0(n 2 (logp) 3 log r). 

For further discussion, see the articles by E. R. Berlekamp, Math. Comp. 24 
(1970), 713-735, and Robert T. Moenck, Math. Comp. 31 (1977), 235-250. 

Distinct-degree factorization. We shall now turn to a somewhat simpler way to 
find factors modulo p. The ideas we have studied so far in this section involve 
many instructive insights into computational algebra, so the author does not 
apologize to the reader for presenting them; but it turns out that the problem 
of factorization modulo p can actually be solved without relying on so many 
concepts. 

In the first place we can make use of the fact that an irreducible polynomial 

d c 

q(x) of degree d is a divisor of x p — x, and it is not a divisor of x p — x for 
c < d; see exercise 16. We can therefore cast out the irreducible factors of each 
degree separately, by adopting the following strategy. 

Dl. Rule out squared factors, as in Berlekamp’s method. Also set v(x) u(x), 
w(x) <— “x”, and d 0. (Here v(x) and w(x) are variables that have 
polynomials as values.) 
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D2. (At this point w(x ) = x pd mod v(x ); all of the irreducible factors of v(x) are 
distinct and have degree > d.) If d +1 > Jdeg(v), the procedure terminates 
since we either have v(x) = 1 or v(x) is irreducible. Otherwise increase d 
by 1 and replace w(x) by w(x) p mod?;(:r). 

D3. Find g d (x) = gcd(iy(x) — x , t»(a:)). (This is the product of all the irreducible 
factors of u(x) whose degree is d.) If gd{x) 1, replace v(x) by v(x)/gd(x) 
and w(x) by w(x) mod v{x)\ and if the degree of g d (x) is greater than d, use 
the algorithm below to find its factors. Return to step D2. | 

This procedure determines the product of all irreducible factors of each 
degree d, and therefore it tells us how many factors there are of each degree. 
Since the three factors of our example polynomial (19) have different degrees, 
they would all be discovered without any need to factorize the polynomials g d (x). 

The distinct degree factorization technique was known to several people 
in 1960 [cf. S. W. Golomb, L. R. Welch, A. Hales, “On the factorization of 
trinomials over GF(2),” Jet Propulsion Laboratory memo 20-189 (July 14,1959)], 
but there seem to be no references to it in the “open literature.” Previous work 
by 5. Schwarz, Quart. J. Math., Oxford ( 2 ) 7 (1956), 110-124, had shown how 
to determine the number of irreducible factors of each degree, but not their 
product, using the matrix Q. 

To complete the method, we need a way to split the polynomial g d {x) into 
its irreducible factors when deg(^) > d. Michael Rabin pointed out in 1976 
that this can be done by doing arithmetic in the field of p d elements. David G. 
Cantor and Hans Zassenhaus discovered in 1979 that there is an even simpler 
way to proceed, based on the following identity: If p is any odd prime, we have 

g d {x) = gcd( 0 d (z), t(x)) • gcd (g d (x), £(z ) (pd ~ 1)/2 -f l) 

• gcd (g d (x), £(z ) (pd-1)/2 — l) ( 21 ) 

for all polynomials t(x), since t(x) pd — t(x) is a multiple of all irreducible polyno¬ 
mials of degree d. (We may regard t(x) as an element of the field of size p d , when 
that field consists of all polynomials modulo an irreducible f(x) as in exercise 16.) 
Now exercise 29 shows that gcd(g d {x), t(x)^ p ~ 1 )/ 2 ) will be a nontrivial factor of 
g d {x) about 50 per cent of the time, when t(x) is a random polynomial of degree 
< 2d — 1; hence it will not take long to discover all of the factors. We may 
assume without loss of generality that £(:r) is monic, since integer multiples of 
t(x) make no difference except possibly to change t{x)^ pd ~ 1 ^ 2 into its negative. 
Thus in the case d — 1 , we can take t(x) = x -\~ s, where s is chosen at random. 
[See SIAM J. Computing 9 (1980), 273-280; Math. Comp., to appear.] 

Sometimes this procedure will in fact succeed for d > 1 when only linear 
polynomials t(x) are used. For example, there are eight irreducible polynomials 
f(x) of degree 3, modulo 3, and they will all be distinguished by calculating 
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gcd(/(x), (x -f- s) 13 — l) for 0 < s < 3: 
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Exercise 31 contains a partial explanation of why linear polynomials can be 
effective; however, when the number of irreducible polynomials of degree d ex¬ 
ceeds 2 P , it is clear that there will exist irreducibles that cannot be distinguished 
by linear choices of t(x). 

An alternative to (21) that works when p — 2 is discussed in exercise 30. 

Factoring over the integers. It is somewhat more difficult to find the complete 
factorization of polynomials with integer coefficients when we are not working 
modulo p, but some reasonably efficient methods are available for this purpose. 

Isaac Newton gave a method for finding linear and quadratic factors of 
polynomials with integer coefficients in his Arithmetics Universalis (1707). This 
method was extended by an astronomer named Friedrich von Schubert in 1793, 
who showed how to find all factors of degree n in a finite number of steps; see 
M. Cantor, Geschichte der Mathematik 4 (Leipzig: Teubner, 1908), 136-137. 
L. Kronecker rediscovered von Schubert's method independently about 90 years 
later; but unfortunately the method is very inefficient when n is five or more. 
Much better results can be obtained with the help of the “mod p” factorization 
methods presented above. 

Suppose that we want to find the irreducible factors of a given polynomial 

u(x) — u n x n + U n — \X n 1 H-h ^0, Uny^O, 

over the integers. As a first step, we can divide by the greatest common divisor of 
the coefficients; this leaves us with a primitive polynomial. We may also assume 
that u(x) is squarefree, by dividing out gcd(u(x),u / (x)) as in exercise 34. 

Now if u(x) = v(x)w(x), where each of these polynomials has integer coef¬ 
ficients, we obviously have u{x) = v{x)w(x) (modulo p) for all primes p, so there 
is a nontrivial factorization modulo p unless p divides i(u). An efficient algorithm 
for factoring u(x) modulo p can therefore be used in an attempt to reconstruct 
possible factorizations of u(x) over the integers. 

For example, let 


u(x) = x 8 + x 6 — 3x 4 — 3x 3 + 8x 2 + 2x — 5. 


(22) 
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We have seen above in (19) that 

u(x) = (a; 4 + 2x 3 -f-3x 2 +4x-j-6)(x 3 +8a: 2 + 4a: + 12)(a:4-3) (modulo 13); (23) 

and the complete factorization of u(x) modulo 2 shows one factor of degree 6 
and another of degree 2 (see exercise 10). From (23) we can see that u(x) has no 
factor of degree 2, so it must be irreducible over the integers. 

This particular example was perhaps too simple; experience shows that most 
irreducible polynomials can be recognized as such by examining their factors 
modulo a few primes, but it is not always so easy to establish irreducibility. For 
example, there are polynomials that can be properly factored modulo p for all 
primes p, with consistent degrees of the factors, yet they are irreducible over the 
integers (see exercise 12). 

Almost all polynomials are irreducible over the integers, as shown in exer¬ 
cise 27. But we usually aren’t trying to factor a random polynomial; there is 
probably some reason to expect a nontrivial factor or else the calculation would 
not have been attempted in the first place. We need a method that identifies 
factors when they are there. 

In general if we try to find the factors of u(x) by considering its behavior 
modulo different primes, the results will not be easy to combine; for example, 
if u(x) actually is the product of four quadratic polynomials, it will be hard to 
match up their images with respect to different prime moduli. Therefore it is 
desirable to stick to a single prime and to see how much mileage we can get out 
of it, once we feel that the factors modulo this prime have the right degrees. 

One idea is to work modulo a very large prime p, big enough so that the 
coefficients in any true factorization u(x) = v(x)w(x) over the integers must 
actually lie between —p/2 and p/2. Then all possible integer factors can be 
“read off’ from the mod p factors we know how to compute. 

Exercise 20 shows how to obtain fairly good bounds on the coefficients of 
polynomial factors. For example, if (22) were reducible it would have a factor 
v(x) of degree < 4, and the coefficients of v would be at most 34 in magnitude 
by the results of that exercise. So all potential factors of u(x ) will be fairly 
evident if we work modulo any prime p > 68. Indeed, the complete factorization 
modulo 71 is 

(x + 12)(x + 25)(x 2 — 13 — 7)(x 4 — 24x 3 — 16x 2 -f 31x — 12), 

and we see immediately that none of these polynomials are factors of (22) over 
the integers since their constant terms do not divide 5; furthermore there is no 
way to obtain a divisor of (22) by grouping two of these factors, since none of 
the conceivable constant terms 12 X 25, —12 X 7, 12 X (—12) is congruent to 
±1 or ±5 (modulo 71). 

Incidentally, it is not trivial to obtain good bounds on the coefficients of poly¬ 
nomial factors, since a lot of cancellation can occur when polynomials are mul¬ 
tiplied. For example, the innocuous-looking polynomial x n — 1 has irreducible 
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factors whose coefficients exceed exp(n ly/lglgn ) for infinitely many n. [See R. C. 
Vaughan, Michigan Math. J. 21 (1974), 289-295.] The factorization of x n — 1 
is discussed in exercise 32. 

Instead of using a large prime p, which might have to be truly enormous 
if u(x) has large degree or large coefficients, we can also make use of small p, 
provided that u(x) is squarefree mod p. For in this case, an important con¬ 
struction introduced by K. Hensel [Theorie der Algebraischen Zahlen (Leipzig: 
Teubner, 1908), Chapter 4] can be used to extend a factorization modulo p in a 
unique way to a factorization modulo p e for arbitrarily high e. Hensel’s method 
is described in exercise 22; if we apply it to (23) with p = 13 and e = 2, we 
obtain the unique factorization 

u(x) = {x — 36)(x 3 — 18x 2 + 82x — 66)(x 4 + 54x 3 — 10x 2 + 69x + 84) 

(modulo 169). Calling these factors Ui(x)u 3 (x)r> 4 (x), we see that rq(x) and ^(x) 
are not factors of u(x) over the integers, nor is their product vi(x)vs(x) when the 
coefficients have been reduced modulo 169 to the range (—^fp, ^f 2 ). Thus we 
have exhausted all possibilities, proving once again that u(x) is irreducible over 
the integers—this time using only its factorization modulo 13. 

The example we have been considering is atypical in one important respect: 
We have been factoring the monic polynomial u(x) in (22), so we could assume 
that all its factors were monic. What should we do if u n > 1? In such a case, the 
leading coefficients of all but one of the polynomial factors can be varied almost 
arbitrarily modulo p e ; we certainly don’t want to try all possibilities. Perhaps the 
reader has already noticed this problem. Fortunately there is a simple way out: 
the factorization u(x) = v(x)w(x) implies a factorization u n u(x) = vi(x)tui(x) 
where £(v 1 ) = £{wi) — u n = £{u). (“Do you mind if I multiply your polynomial 
by its leading coefficient before factoring it?”) We can proceed essentially as 
above, but using p e > 2 B where B now bounds the maximum coefficient for 
factors of u n u(x) instead of u(x). 

Putting these observations all together results in the following procedure: 
FI. Find the unique squarefree factorization 

u(x) = £(u)d 1 (x) ... v r (x) (modulo p e ), 

where p e is sufficiently large as explained above, and where the v 3 (x) are 
monic. (This will be possible for all but a few primes p, see exercise 23.) 
Also set d <— 1. 

F2. For every combination of factors v{x) = v^(x)... u 2d (x), with = 1 if 
d = Jr, form the unique polynomial v(x) = £{u)v{x) (modulo p e ) whose co¬ 
efficients all lie in the interval [—Jp e , Jp e ). If v(x) divides £(u)u(x), output 
the factor pp(u(x)), divide u(x) by this factor, and remove the corresponding 
V{(x) from the list of factors modulo p e ; decrease r by the number of factors 
removed, and terminate the algorithm if d > \r. 

F3. Increase d by 1, and return to F2 if d > Jr. | 
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At the conclusion of this process, the current value of u(x) will be the final 
irreducible factor of the originally given polynomial. Note that if |wq| < \u n \, it 

is preferable to do all of the work with the reverse polynomial UqX u -|-1- u n , 

whose factors are the reverses of the factors of u(x). 

The procedure as stated requires p e > 2 B, where B is a bound on the co¬ 
efficients of any divisor of u n u(x), but we can use a much smaller value of B if 
we only guarantee it to be valid for divisors of degree < Jdeg(u). In this case 
the divisibility test in step F2 should be applied to w(x) — vi(x ).. .v r (x)/v(x) 
instead of v(x), whenever deg(u) > ^deg(n). 

The above algorithm contains an obvious bottleneck: We may have to test 
as many as 2 r ~ 1 potential factors v(x). The average value of 2 r in a random 
situation is about n, or perhaps n 15 (see exercise 5), but in nonrandom situations 
we will want to speed up this part of the routine as much as we can. One 
way to rule out spurious factors quickly is to compute the trailing coefficient 
v(0) first, continuing only if this divides ^(it)it(O); the complication explained in 
the preceding paragraph does not have to be considered unless this divisibility 
condition is satisfied, since such a test is valid even when deg(t>) > jdeg(u). 

Another important way to speed up the procedure is to reduce r so that 
it tends to reflect the true number of factors. The distinct degree factorization 
algorithm above can be applied for various small primes pj, thus obtaining for 
each prime a set D 3 of possible degrees of factors modulo p ; -; see exercise 26. We 
can represent Dj as a string of n binary bits. Now we compute the intersection 
f]Dj, namely the logical “and” of these bit strings, and we perform step F2 only 
for ii • * • -j- id £ f)Dj- Furthermore p is chosen to be that pj having the 
smallest value of r. This technique is due to David R. Musser, whose experience 
suggests trying about five primes p 3 (see JACM 25 (1978), 271-282). Of course 
we would stop immediately if the current f) Dj shows that u(x) is irreducible. 

Musser has given a complete discussion of a factorization method similar to 
the steps above, in JACM 22 (1975), 291-308. The procedure above incorporates 
an improvement suggested in 1978 by G. E. Collins, namely to look for trial 
divisors by taking combinations of d factors at a time rather than combinations of 
total degree d. This improvement is important because of the statistical behavior 
of the modulo-p factors of polynomials that are irreducible over the rationals (cf. 
exercise 37). 

Greatest common divisors. Similar techniques can be used to calculate greatest 
common divisors of polynomials: If gcd(u(x), v(x)) = d(x) over the integers, and 
if gcd(it(x),v(x)) = q(x) (modulo p) where q(x) is monic, then d(x) is a common 
divisor of u(x) and v(x) modulo p; hence 

d(x) divides q(x) (modulo p). (24) 

If p does not divide the leading coefficients of both u and v, it does not divide the 
leading coefficient of d\ in such a case deg(d) < deg(q). When q(x) — 1 for such 
a prime p, we must therefore have deg(d) = 0, and d(x) — gcd(cont(it), cont(v)). 
This justifies the remark made in Section 4.6.1 that the simple computation 
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of gcd(w(x), v(a;)) modulo 13 in 4.6.1-6 is enough to prove that u(x) and v{x) 
are relatively prime over the integers; the comparatively laborious calculations 
of Algorithm 4.6.IE or Algorithm 4.6.1C are unnecessary. Since two random 
primitive polynomials are almost always relatively prime over the integers, and 
since they are relatively prime modulo p with probability 1 — 1/p, it is usually 
a good idea to do the computations modulo p. 

As remarked before, we need good methods also for the nonrandom polyno¬ 
mials that arise in practice. Therefore we wish to sharpen our techniques and 
discover how to find gcd(u(a;), v(a;)) in general, over the integers, based entirely 
on information that we obtain working modulo primes p. We may assume that 
u{x) and v{x) are primitive. 

Instead of calculating gcd(tt(a;), t>(x)) directly, it will be convenient to search 
instead for the polynomial 


d(x) = c • gcd (u(x), v(x)), (25) 

where the constant c is chosen so that 

£{d) = gcd(^(u),^(u)). (26) 

This condition will always hold for suitable c, since the leading coefficient of any 
common divisor of u(x) and v(x) must be a divisor of gcd(£(u), £(v)). Once d(x) 
has been found satisfying these conditions, we can readily compute pp(d(z)), 
which is the true greatest common divisor of u(x) and v(x). Condition (26) 
is convenient since it avoids the uncertainty of unit multiples of the gcd; we 
have used essentially the same idea to control the leading coefficients in our 
factorization routine. 

If p is a sufficiently large prime, based on the bounds for coefficients in 
exercise 20 applied either to £{d)u(x) or £(d)v(x), let us compute the unique 
polynomial q(x) = £(d)q(x) (modulo p) having all coefficients in [— \p, \p ). 
When pp (q{x)) divides both u(x) and v(x), it must equal gcd(u(z), v(x}) because 
of (24). On the other hand if it does not divide both u(x) and v{x) we must 
have deg(q) > deg(d). A study of Algorithm 4.6.IE reveals that this will be the 
case only if p divides the leading coefficient of one of the nonzero remainders 
computed by that algorithm with exact integer arithmetic; otherwise Euclid’s 
algorithm modulo p deals with precisely the same sequence of polynomials as 
Algorithm 4.6.IE except for nonzero constant multiples (modulo p). So only a 
small number of “unlucky” primes can cause us to miss the gcd, and we will 
soon find a lucky prime if we keep trying. 

If the bound on coefficients is so large that single-precision primes p are 
insufficient, we can compute d(x) modulo several primes p until it has been 
determined via the Chinese remainder algorithm of Section 4.3.2. This approach, 
which is due to W. S. Brown and G. E. Collins, has been described in detail by 
Brown in JACM 18 (1971), 478-504. Alternatively, as suggested by J. Moses 
and D. Y. Y. Yun [Proc. ACM Conf. 28 (1973), 159-166], we can use Hensel’s 
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method to determine d(x ) modulo p e for sufficiently large e. Hensel’s construction 
appears to be computationally superior to the Chinese remainder approach; but 
it is valid directly only when 

gcd(d(x), u(x)/d(x)) = 1 or gcd(d(z), v(x)/d(x)) = 1, (27) 

since the idea is to apply the techniques of exercise 22 to one of the factorizations 
£(d)u(x) = q(x)u\(x) or £{d)v(x) = q(x)v\(x) (modulo p). Exercises 34 and 35 
show that it is possible to arrange things so that (27) holds whenever necessary. 

The gcd algorithms sketched here are significantly faster than those of 
Section 4.6.1 except when the polynomial remainder sequence is very short. 
Perhaps the best general procedure would be to start with the computation of 
gcd(u(z), v(x)) modulo a fairly small prime p, not a divisor of both £(u) and £(v). 
If the result q(x) is 1, we’re done; if it has high degree, we use Algorithm 4.6.1C; 
otherwise we use one of the above methods, first computing a bound for the 
coefficients of d(x) based on the coefficients of u(x) and v(a:), and on the (small) 
degree of q(x). As in the factorization problem, we should apply this procedure 
to the reverses of u(x), v(x) and reverse the result, if the trailing coefficients are 
simpler than the leading ones. 

Multivariate polynomials. Similar techniques lead to useful algorithms for fac¬ 
torization or gcd calculations on multivariate polynomials with integer coeffi¬ 
cients. It is convenient to deal with the polynomial u(xi,...,x t ) by working 
modulo the irreducible polynomials x 2 — 02 ? • • •, xt — at, which play the role of p 
in the above discussion. Since v(x) mod (x — a) = v(a), the value of u(x 1 ,..., x t ) 
is the univariate polynomial u(xi,a 2 ,..., a t ). When the integers a 2 , ..., a t have 
been chosen so that u(x 1 , a 2 ,..., a t ) has the same degree in X\ as u(xi,x 2 ,..., x t ), 
an appropriate generalization of Hensel’s construction will “lift” squarefree fac¬ 
torizations of this univariate polynomial to factorizations modulo (x 2 — a 2 ) n2 , 
..., (x t — a t ) n % where rij is the degree of x 3 in u; at the same time we c^n 
also work modulo an appropriate integer prime p. As many as possible of the 
a 3 should be zero, so that sparseness of the intermediate results is retained. For 
details, see P. S. Wang, Math. Comp. 32 (1978), 1215-1231, in addition to the 
papers by Musser and by Moses and Yun cited earlier. 

EXERCISES 

► 1 . [M24] Let p be prime. What is the probability that a random polynomial of 
degree n has a linear factor (a factor of degree 1), when n > p? Show that this 
probability is more than 5 . (Assume that each of the p n monic polynomials modulo p 
is equally likely.) What is the average number of linear factors? 

► 2. [ M25} (a) Show that any monic polynomial u(x), over a unique factorization 
domain, may be expressed uniquely in the form 

•u(z) = v(x) 2 w(x), 

where w(x) is squarefree (has no factor of positive degree of the form d(x) 2 ) and both 
v(x) and w(x) are monic. (b) (E. R. Berlekamp.) How many monic polynomials of 
degree n are squarefree modulo p, when p is prime? 
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3. [ M25] Let U\{x), ..., u r {x) be polynomials over a field S, with Uj(x) relatively 
prime to Uk(x) for all j ^ k. For any given polynomials w\(x), ..., w r {x) over S, prove 
that there is a unique polynomial v(x) over S such that 

deg(u) < deg(ui) H-(- deg(w r ) 


and 

v(x) = Wj(x) (modulo Uj(x)) 
for 1 < j < r. (Compare with Theorem 4.3.2C.) 

4. \HM28] Let a np be the number of monic irreducible polynomials of degree n, 
modulo a prime p. Find a formula for the generating function G p (z ) = ^2 n a npZ n . 

[Hint: Prove the following identity connecting power series: f(z) = 9( z ^)/ 7* ^ 

and only if g(z) = ]T) n>1 M( n )/(2 n )/ nt -] What is lim p _oo a np /p n ? 

5. [ HM30 ] Let A np be the average number of factors of a randomly selected poly¬ 
nomial of degree n, modulo a prime p. Show that lim p -ooAnp = H n . What is the 
limiting average value of 2 r , when there are r factors? 

6. [ M21] (J. L. Lagrange, 1771.) Prove the congruence (9). [Hint: Factor x v ~ x 
in the field of p elements.] 

7. [M22] Prove Eq. (14). 

8. [HM20] How can we be sure that the vectors output by Algorithm N are linearly 
independent? 

9. [20\ Explain how to construct a table of reciprocals mod 101 in a simple way, 
given that 2 is a primitive root of 101. 

► 10. [21] Find the complete factorization of the polynomial u(x ) in (22), modulo 2, 
using Berlekamp’s procedure. 

11. [22] Find the complete factorization of the polynomial u(x) in (22), modulo 5. 

► 12. [M22] Use Berlekamp’s algorithm to determine the number of factors of u(x) — 
x 4 -f 1, modulo p, for all primes p. [Hint: Consider the cases p = 2, p — 8k + 1, 
p ~ Sk -f 3, p — 8k -)- 5, p = 8fc -(- 7 separately; what is the matrix Q ? You need not 
discover the factors; just determine how many there are.] 

13. [M25] Give an explicit formula for the factors of x 4 + 1, modulo p , for all 
odd primes p, in terms of the quantities y/~~ 1, \/2, V—2 (if such square roots exist 
modulo p ). 

14. [M25] (H. Zassenhaus.) Let v(x) be a solution to (8), and let w(x) = Y\(x - s ) 
where the product is over all 0 < s < p such that gcd(u(z), v(x ) — s) 7^ 1. Explain 
how to compute w(x), given u(x) and v(x). [Hint: Eq. (14) implies that w(x) is the 
polynomial of least degree such that u(x) divides u>(u(z)).] 

► 15. [M27] Design an algorithm to calculate the “square root” of a given integer u 
modulo a given prime p, i.e., to find an integer v such that v 2 = u (modulo p) whenever 
such a v exists. Your algorithm should be efficient even for very large primes p. (Note 
that a solution to this problem leads to a procedure for solving any given quadratic 
equation modulo p, using the quadratic formula in the usual way.) 
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16 . [M30 ] (a) Given that f(x ) is an irreducible polynomial modulo a prime p, of 
degree n, prove that the p n polynomials of degree less than n form a field under 
arithmetic modulo f(x ) and p. (Note: The existence of irreducible polynomials of each 
degree is proved in exercise 4; therefore fields with p n elements exist for all primes p and 
all n > 1.) (b) Show that any field with p n elements has a “primitive root” element £ 
such that the elements of the field are {0, 1, £, £ 2 ,..., £P n — [Hint: Exercise 3.2.1.2-16 
provides a proof in the special case n = 1.] (c) If f(x) is an irreducible polynomial 
modulo p, of degree n, prove that x p — x is divisible by f(x) if and only if m is a 
multiple of n. (It follows that we can test irreducibility rather quickly: A given nth 
degree polynomial f(x) is irreducible modulo p if and only if x p —x is divisible by f(x) 
and gcd(z p lQ — x, f(x)) = 1 for all primes q dividing n.) 

17 . [ M23} Let F be a field with 13 2 elements. How many elements of F have order /, 
for each integer / with 1 < / < 13 2 ? (The “order” of an element a is the least positive 
integer m such that a m = 1.) 

► 18 . [M25] Let u(x) = u n x n + •••-(- uo, u n 0, be a primitive polynomial with 
integer coefficients, and let v(x) be the monic polynomial defined by 

V(x) — Un~ l ■ U(x/U n ) = X™ + U n - iX n_1 + U n - 2 U n X n ~ 2 -\ -U()U” -1 . 

(a) Given that v(x) has the complete factorization pi(x )... p r (x) over the integers, where 
each Pj(x) is monic, what is the complete factorization of u(x) over the integers? (b) 

If w(x) — x m -f w m ,--ix rn ~ 1 H- \- Wo is a factor of v(x), prove that Wk is a multiple 

of k for 0 < k < m. 

19 . [M20] (Eisenstein’s criterion .) Perhaps the best-known class of irreducible poly¬ 
nomials over the integers was introduced by G. Eisenstein in J. fur die reine und a ngew. 
Math. 39 (1850), 166-167: Let p be prime and let u(x) — u n x n -j- • • ■ -j- uq have the 
following properties: (i) u n is not divisible by p; (ii) u n —1, ..., uo are divisible by p; 
(iii) uo is not divisible by p 2 . Show that u(x) is irreducible over the integers. 

20. [HM33] If u(x) = u n x n -1-(-uo is any polynomial over the complex numbers, let 

|u| = (\u n \ 2j i -Oo| 2 ) 1/2 . ( a ) Let g(x) = (x—a)u(x) and h(x ) = (ax—l)u(x), where 

a is any complex number and a is its complex conjugate. Prove that |p| — \h\. (b) Let 
the complete factorization of u(x) over the complex numbers be u n (x — qi) ... (x — a n ), 
and write M(u) = P[ 1<j<n max(l, |aj|). Prove that M(u) < ju|/|u n |. (c) Show that 

M < l“»l(( n 7 1 )^M -f (-z, 1 )) for 0 < j < n. (d) Combine these results to prove 
that if u(x) — v(x)w(x) and v(x) — VmX m -j- • • ■ -{- Vo, where u, v, w all have integer 
coefficients, then the coefficients of v are bounded by 

M < (”7 1 )M + (TrdKl. 


21 . [HM30] The purpose of this exercise is to obtain useful bounds on the coefficients 
of multivariate polynomial factors over the integers. Given a polynomial u(x \,..., Xt) 
over the complex numbers, let |u| be (^ \uj 1 ...j t \ 2 ) 1 ^ 2 summed over all the coefficients. 
Let e(x) = e 2 ' K1X . (a) Prove that 

f \u(e(6i) . e{6 t ))\ 2 de t ...Mi. 
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(b) Let u(x) = v(x)w(x), where deg(i>) == m and deg(u>) — k. Use the results proved 
in exercise 20 to show that we always have |'y||iy| < f(m, k) 1/2 \u\, where f(m,k ) — 
( 2 r ^)( 2 fc fc ). (c) Let u(x i,..., xt) = v{x \,..., xt)w(xi ,..., xt), where v and w have the 
respective degrees m 3 and kj in x 3 . Prove that 

MM < (f(mi,ki)... f{m t , k t )) 1/2 \u\. 


► 22. [M24] ( Hensel’s Lemma.) Let u(x), v e (x), w e (x), a(x), b(x) be polynomials with 
integer coefficients, satisfying the relations 

u(x) = v e (x)w e {x) (modulo p e ), a(x)v e (x) + b(x)w e {x) = 1 (modulo p), 

where p is prime, e > 1, v e (x) is monic, deg(a) < deg(u> e ), deg(b) < deg(t? e ), and 
deg(w) = deg(ve) + deg(u; c ). Show how to compute polynomials u e -|-i(x) ee v e {x) and 
u> e+ i(z) — w e {x) (modulo p e ), satisfying the same conditions with e increased by 1. 
Furthermore, prove that v e +i{x) and w e +i(x) are unique, modulo p e+1 . 

Use your method for p = 2 to prove that (22) is irreducible over the integers, 
starting with its factorization mod 2 found in exercise 10. (Note that Euclid’s extended 
algorithm, exercise 4.6.1-3, will get the process started for e = 1.) 

23. [ HM23 ] Let u(x) be a squarefree polynomial with integer coefficients. Prove that 
there are only finitely many primes p such that u(x) is not squarefree modulo p. 

24. [ M20 ] The text speaks only of factorization over the integers, not over the field 
of rational numbers. Explain how to find the complete factorization of a polynomial 
with rational coefficients, over the field of rational numbers. 

25. [ M25 ] What is the complete factorization of x 5 -(- x 4 -f- x 2 x -f- 2 over the field 
of rational numbers? 

26. [20] Let d\ , ..., d r be the degrees of the irreducible factors of u(x) modulo p, 

with proper multiplicity, so that di -j- \-d r = n — deg(w). Explain how to compute 

the set { deg(v) | u(x) = v(x)w(x) (modulo p) for some v(x), w(x) } by performing O(r) 
operations on binary bit strings of length n. 

27. [HMSO] Prove that a random primitive polynomial over the integers is “almost 
always” irreducible, in some appropriate sense. 

28. [ M25 ] The distinct-degree factorization procedure is “lucky” when there is at 
most one irreducible polynomial of each degree d; then gd{x) never needs to be broken 
into factors. What is the probability of such a lucky circumstance, when factoring a 
random polynomial of degree n, modulo p, for fixed n as p —► oo? 

29. [M22] Let g(x) be a product of two or more distinct irreducible polynomials of 

degree d, modulo an odd prime p. Prove that gcd(p(x), t(x)^ p — 1 ^ 2 — l) will be a proper 
factor of g(x) with probability > 1/2 —1/(2 p d ), for any fixed g{x), when t{x) is selected 
at random from among the p 2d polynomials of degree < 2d modulo p. 

30. [M25] Prove that if q(x) is an irreducible polynomial of degree d, modulo p, and if 

2 d —— 1 

t(x) is any polynomial, then the value of (t(x)-\-t(x) p -\-t(x) p -j- \-t(x) p )modq(x) 

is an integer (i.e., a polynomial of degree < 0). Use this fact to design a probabilistic 
algorithm for factoring a product gd{x) of degree-d irreducibles, analogous to (21), for 
the case p — 2. 
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31. [. HM30 ] Let p be an odd prime and let d > 1. Show that there exists a number 
n(p,d) having the following two properties: (a) For all integers t, exactly n(p, d) 

irreducible polynomials q(x) of degree d, modulo p, satisfy (x-|-t)(p d—1 )/ 2 modq(x) = 1. 
(b) For all integers 0 < ti < t 2 < p, exactly n(p, d) irreducible polynomials q(x) of 
degree d, modulo p, satisfy (x + t 1 ) (pd ~ 1)/2 modq(x) = (x -f t 2 ) (pd_1)/2 mod q(x). 

► 32 .[MSO] (Cyclotomic polynomials.) Let 'l'n(z) = Th<k<n, g cd(k,n)=i( x “ ^ 
where w = e 27ri/n ; thus, the roots of ^(z) are the complex nth roots of unity that 
aren’t rath roots for ra < n. (a) Prove that ^ n (x) is a polynomial with integer coeffi¬ 
cients, and that 


x " - 1 =n - ir Md) . 

d\n d\n 

(Cf. exercises 4.5.2-10(b) and 4.5.3-28(c).) (b) Prove that ty n (x) is irreducible over 
the integers, hence the above formula is the complete factorization of x n — 1 over the 
integers. [Hint: If /(x) is an irreducible factor of ^(z) over the integers, and if f is a 
complex number with /(c) = 0, prove that /(c p ) = 0 for all primes p not dividing n. 
It may help to use the fact that x n — 1 is squarefree modulo p for all such primes.] 
(c) Discuss the calculation of ty n (z), and tabulate the values for n < 15. 

33. [ M18 ] True or false: If u(x) 0 and the complete factorization of u(x) modulo p 
is pi(z) ei .. .p r (x) er , then u(x)/gcd(u(x), u'(x)) — pi(x).. .p r (x). 

► 34. [ M25 ] (Squarefree factorization.) It is clear that any primitive polynomial of a 
unique factorization domain can be expressed in the form u(x) — ui(x)u2(x) 2 U3(x) 3 ..., 
where the polynomials Ui(x) are squarefree and relatively prime to each other. This 
representation, in which Uj(x) is the product of all the irreducible polynomials that 
divide u(x) exactly j times, is unique except for unit multiples; and it is a useful way to 
represent polynomials that participate in multiplication, division, and gcd operations. 

Let GCD(u(z), v(x)) be a procedure that returns three answers: 

GCD(w(x), v(x)) = (d(x), u(x)/d(x), v(x)/d(x)), where d(x) = gcd(u(x), v(x)). 

The modular method described in the text following Eq. (25) always ends with a trial 
division of u(x)/d(x) and v(x)/d(x), to make sure that no “unlucky prime” has been 
used, so the quantities u{x)/d{x ) and v(x)/d(x) are byproducts of the gcd computation; 
thus we can compute GCD(u(z), v(x)) essentially as fast as gcd(u(x), u(x)) when we are 
using a modular method. 

Devise a procedure that obtains the squarefree representation (ui(x), 112(2 ;),...) of 
a given primitive polynomial u(x) over the integers. Your algorithm should perform 
exactly e computations of a GCD, where e is the largest subscript with u e (x) 7^ 1; 
furthermore, each GCD calculation should satisfy (27), so that Hensel’s construction 
can be used. 

35. [M22] (D. Y. Y. Yun.) Design an algorithm that computes the squarefree rep¬ 
resentation (u?i(x), W2{x), ...) of w(x) — gcd(u(x), u(x)) over the integers, given the 
squarefree representations (ui(x), U2(x), ...) and (ui(x), i>2 (z),...) of u(x) and u(x). 

36. [ M27] Extend the procedure of exercise 34 so that it will obtain the squarefree 
representation (ui(x), u 2 (x),...) of a given polynomial u(x) when the coefficient arith¬ 
metic is performed modulo p. 
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37. [HM24\ (George E. Collins.) Let di, d r be positive integers whose sum is n, 
and let p be prime. What is the probability that the irreducible factors of a random nth- 
degree integer polynomial u(x) have degrees d\, ..., d r , when it is completely factored 
modulo p? Show that this probability is asymptotically the same as the probability 
that a random permutation on n elements has cycles of lengths d\ , ..., d r . 

38. [ M4.8 ] (V. R. Pratt.) If possible, find a way to construct proofs of irreducibility 
for all polynomials that are irreducible over the integers, so that the length of proof is 
at most a polynomial in deg(n) and the length of its coefficients. (Only a bound on 
the length of proof is requested here, as in exercise 4.5.4-17, not a bound on the time 
needed to find such a proof.) 


4.6.3. Evaluation of Powers 

In this section we shall study the interesting problem of computing x n 
efficiently, given x and n, where n is a positive integer. Suppose, for example, 
that we need to compute x 16 ; we could simply start with x and multiply by x 
fifteen times. But it is possible to obtain the same answer with only four mul¬ 
tiplications, if we repeatedly take the square of each partial result, successively 
forming x 2 , x 4 , x 8 , x 16 . 

The same idea applies, in general, to any value of n, in the following way: 
Write n in the binary number system (suppressing zeros at the left). Then 
replace each “1” by the pair of letters SX, replace each “0” by S, and cross off 
the “SX” that now appears at the left. The result is a rule for computing x n , if 
“S” is interpreted as the operation of squaring, and if “X” is interpreted as the 
operation of multiplying by x. For example, if n = 23, its binary representation 
is 10111; so we form the sequence SX S SX SX SX and remove the leading SX 
to obtain the rule SSXSXSX. This rule states that we should “square, square, 
multiply by x, square, multiply by x, square, and multiply by x”; in other words, 
we should successively compute x 2 , x 4 , x 5 , x 10 , x 11 , x 22 , x 23 . 

This “binary method” is easily justified by a consideration of the sequence of 
exponents in the calculation: If we reinterpret “S” as the operation of multiplying 
by 2 and ‘X” as the operation of adding 1, and if we start with 1 instead of x, 
the rule will lead to a computation of n because of the properties of the binary 
number system. The method is quite ancient; it appeared before 200 B.C. in 
Pingala’s Hindu classic Chandah-sutra [see B. Datta and A. N. Singh, History 
of Hindu Mathematics 1 (Bombay: 1935), 76]; however, there seem to be no 
other references to this method outside of India during the next 1000 years. 
A clear discussion of how to compute 2 n efficiently for arbitrary n was given 
by al-Uqlidisi of Damascus in 952 A.D.; see The Arithmetic of al-Uqlidisi by 
A. S. Saidan (Dordrecht: D. Reidel, 1975), 341-342, where the general ideas 
are illustrated for n = 51. See also al-Biruni’s Chronology of Ancient Nations, 
ed. and translated by E. Sachau (London: 1879), 132-136; this eleventh-century 
Arabic work had great influence. 

The S-and-X binary method for obtaining x n requires no temporary storage 
except for x and the current partial result, so it is well suited for incorpora- 
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Fig. 12. Evaluation of x n , based on a right-to-left scan of the binary notation for n. 

tion in the hardware of a binary computer. The method can also be readily 
programmed for either binary or decimal computers; but it requires that the 
binary representation of n be scanned from left to right, while it is usually more 
convenient to do this from right to left. For example, with a binary computer 
we can shift the binary value of n to the right one bit at a time until zero is 
reached; with a decimal computer we can divide by 2 (or, equivalently, multiply 
by 5 or J) to deduce the binary representation from right to left. Therefore the 
following algorithm, based on a right-to-left scan of the number, is often more 
convenient: 

Algorithm A ( Right-to-left binary method for exponentiation). This algorithm 
evaluates x n , where n is a positive integer. (Here x belongs to any algebraic 
system in which an associative multiplication, with identity element 1, has been 
defined.) 

Al. [Initialize.] Set N <— n,Y <— 1, Z *— x. 

A2. [Halve N.] (At this point, x n — Y • Z N .) Set N <— [N/2\, and at the same 
time determine whether N was even or odd. If N was even, skip to step A5. 

A3. [Multiply Y by Z] Set Y <— Z times Y. 

A4. [N = 0?] If N = 0, the algorithm terminates, with Y as the answer. 

A5. [Square Z] Set Z *— Z times Z, and return to step A2. | 

As an example of Algorithm A, consider the steps in the evaluation of x 23 : 



N 

Y 

Z 

After step Al 

23 

1 

X 

After step A4 

11 

X 

X 

After step A4 

5 

X 3 

x 2 

After step A4 

2 

X 7 

X 4 

After step A4 

0 

X 23 

x 16 


A MIX program corresponding to Algorithm A appears in exercise 2. 
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The great calculator al-Kashi stated Algorithm A about 1414 A.D. [Istorikch 
Mat. Issledovaniia 7 (1954), 256-257]. The method is closely related to a proce¬ 
dure for multiplication that was actually used by Egyptian mathematicians as 
early as 1800 B.C.; for if we change step A3 to “Y <— Y -(- Z” and step A5 to 
“Z *— Z Z” , and if we set Y to zero instead of unity in step Al, the algorithm 
terminates with Y = nx. This is a practical method for multiplication by hand, 
since it involves only the simple operations of doubling, halving, and adding. It 
is often called the “Russian peasant method” of multiplication, since Western 
visitors to Russia in the nineteenth century found the method in wide use there. 

The number of multiplications required by Algorithm A is [lgnj -f- ^(w), 
where v(n) is the number of ones in the binary representation of n. This is 
one more multiplication than the left-to-right binary method mentioned at the 
beginning of this section would require, due to the fact that the first execution 
of step A3 is simply a multiplication by unity. 

Because of the bookkeeping time required by this algorithm, the binary 
method is usually not of importance for small values of n, say n < 10, unless 
the time for a multiplication is comparatively large. If the value of n is known in 
advance, the left-to-right binary method is preferable. In some situations, such 
as the calculation of z n modu(:r) discussed in Section 4.6.2, it is much easier 
to multiply by x than to perform a general multiplication or to square a value, 
so binary methods for exponentiation are primarily suited for quite large n in 
such cases. If we wish to calculate the exact multiple-precision value of x n , 
when x is an integer > 1, binary methods are no help unless n is so huge 
that the high-speed multiplication routines of Section 4.3.3 are involved; and 
such applications are rare. Similarly, binary methods are usually inappropriate 
for raising a polynomial to a power; see R. J. Fateman, SIAM J. Computing 
3 (1974), 196-213, for a discussion of the extensive literature on polynomial 
exponentiation. The point of these remarks is that binary methods are nice, 
but not a panacea. They are most applicable when the time to multiply x 3 • x k 
is essentially independent of j and k (e.g., for floating point multiplication, or 
multiplication mod m); in such cases the running time is reduced from order n 
to order logn. 

Fewer multiplications. Several authors have published statements (without proof) 
that the binary method actually gives the minimum possible number of multi¬ 
plications. But this is not true. The smallest counterexample is n = 15, when the 
binary method needs six multiplications, yet we can calculate y — x 3 in two mul¬ 
tiplications and z 15 = y 5 in three more, achieving the desired result with only 
five multiplications. Let us now discuss some other procedures for evaluating x n , 
for applications when n is known in advance (e.g., within an optimizing compiler). 

The factor method is based on a factorization of n. If n — pq , where p is the 
smallest prime factor of n and q > 1, we may calculate x n by first calculating x p 
and then raising this quantity to the qth power. If n is prime, we may calculate 
x n ~~ 1 and multiply by x. And, of course, if n — 1 , we have x n with no calculation 
at all. Repeated application of these rules gives a procedure for evaluating x n , 
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given any value of n. For example, if we want to calculate z 55 , we first evaluate 
y = x 5 = x 4 x — ( x 2 ) 2 x ; then we form y 11 = y 10 y = ( y 2 ) 5 y . The whole process 
takes eight multiplications, while the binary method would have required nine. 
The factor method is better than the binary method on the average, but there 
are cases (n = 33 is the smallest example) where the binary method excels. 

The binary method can be generalized to an m-ary method as follows: Let 
n — d 0 m 1 -j- dim 1 — 1 + •••-[- d t , where 0 < d 3 < m for 0 < j < t. The 
computation begins by forming x, x 2 , x 3 , ..., z m—1 . (Actually, only those 
powers x dj such that dj appears in the representation of n are needed, and this 
observation often saves some of the work.) Then raise x do to the mth power 
and multiply by x dl ; we have computed yi = x d ° m ^~ dl . Next, raise yi to the 
mth power and multiply by x d2 , obtaining y 2 — x d ° m + dim + d2 . The process 
continues in this way until y t — x n has been computed. Whenever dj = 0, it 
is, of course, unnecessary to multiply by x dj . Note that this method reduces to 
the left-to-right binary method discussed earlier, when m = 2; there is also a 
less obvious right-to-left m-ary method that takes more memory but only a few 
more steps (see exercise 9). If m is a small prime, the m-ary method will be 
particularly efficient for calculating powers of one polynomial modulo another, 
when the coefficients are treated modulo m (see Eq. 4.6.2-5). 

A systematic method that gives the minimum number of multiplications 
for all of the relatively small values of n (in particular, for most n that occur in 
practical applications) is indicated in Fig. 13. To calculate x n , find n in this tree, 
and the path from the root to n indicates the sequence of exponents that occur 
in an efficient evaluation of x n . The rule for generating this “power tree” appears 
in exercise 5. Computer tests have shown that the power tree gives optimum 
results for all of the n listed in the figure. But for large enough values of n the 
power tree method is not always an optimum procedure; the smallest examples 
are n = 77, 154, 233. The first case for which the power tree is superior to both 
the binary method and the factor method is n = 23. The first case for which 
the factor method beats the power tree method is n = 19879 = 103 • 193; such 
cases are quite rare. (For n < 100,000 the power tree method is better than the 
factor method 88,803 times; it ties 11,191 times; and it loses only 6 times.) 

Addition chains. The most economical way to compute x n by multiplication 
is a mathematical problem with an interesting history. We shall now examine 
it in detail, not only because it is interesting in its own right, but because it 
is an excellent example of the theoretical questions that arise in the study of 
“optimum methods of computation.” 

Although we are concerned with multiplication of powers of x, the problem 
can easily be reduced to addition, since the exponents are additive. This leads 
us to the following abstract formulation: An addition chain for n is a sequence 
of integers 

1 — a 0 , fli, • • •, ci r — n (1) 

with the property that 

a t = a 3 4- afc, for some k < j < i, (2) 
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//\ I /\ /\ /\ I /\ \ 

19 21 28 22 23 26 25 30 40 27 36 48 33 34 64 

I /\ A\ I I /\ I /\ / \ I /\ /l\ I I /\ 

38 35 42 29 31 56 44 46 39 52 50 45 60 41 43 80 54 37 72 49 51 96 66 68 65 128 

Fig. 13. The “power tree.” 


for all i = 1, 2, ..., r. One way of looking at this definition is to consider a 
simple computer that has an accumulator and is capable of the three operations 
LDA, STA, and ADD; the machine begins with the number 1 in its accumulator, 
and it proceeds to compute the number n by adding together previous results. 
Note that ai must equal 2, and a 2 is either 2, 3, or 4. 

The shortest length, r, for which there exists an addition chain for n is 
denoted by l{ri). Our goal in the remainder of this section is to discover as much 
as we can about this function l(n). The values of l(n) for small n are displayed 
in tree form in Fig. 14, which shows how to calculate x n with the fewest possible 
multiplications for all n < 100. 

The problem of determining l(n ) was apparently first raised by H. Dellac in 
1894, and a partial solution by E. de Jonquieres mentioned the factor method 
[cf. Vlntermediaire des Mathematicians 1 (1894), 20, 162-164]. In his solution, 
de Jonquieres listed what he felt were the values of l(p) for all prime numbers 
p < 200, but his table entries for p = 107, 149, 163, 179 were one too high. 

The factor method tells us immediately that 

l(mn) < l(m) -j- l(n), (3) 

since we can take the chains 1, Oi, ..., a r = m and 1 , 61 , ..., b s = n and form 
the chain 1, , ..., ..., 0^6^ — nfXTt . 

We can also recast the m-ary method into addition-chain terminology. Con¬ 
sider the case m = 2 k , and write n = d 0 m 1 -j- dim t ~ 1 -f--b i n the m-ary 

number system; the corresponding addition chain takes the form 

1,2 ,3,... ,m — 2 , m — 1 , 

2do, 4do ,..., Tndo, w^do -f- d\, 

2(md 0 -\-d 1 ),4(mdo-\-di ),..., m(mdo-\-di), m 2 d 0 + mdi d 2 , 

..., m t do 4- -]- 


( 4 ) 
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14 11 20 15 24 13 17 18 32 

/I /I /\ /\ l\ I I I l\ 

19 28 21 22 23 40 27 30 25 48 26 34 36 33 64 

11\ l\ I I /\ /\ /\ | /\ /\ /\ l\ |\\ 

38 29 56 31 42 44 46 41 80 39 54 45 60 50 51 96 35 52 43 68 37 72 49 66 65 

/ / /\ III /\ I A Ml /\ I I /\ II /\ \ \ \ \ \ \ 

76 58 57 59 62 84 88 47 92 82 83 85 78 55 90 63 75 100 53 97 99 70 61 77 86 69 74 73 98 67 81 

III III I I 

89 94 93 95 79 91 71 87 

Fig. 14. A tree that minimizes the number of multiplications, for n < 100 . 


The length of this chain is m — 2 -j- (k -j- 1 )£; and it can often be reduced by de¬ 
leting certain elements of the first row that do not occur among the coefficients dj, 
plus elements among 2 d 0 , 4do, ... that already appear in the first row. Whenever 
digit d 3 is zero, the step at the right end of the corresponding line may, of course, 
be dropped. Furthermore, as E. C. Thurber has observed [Duke Math. J. 40 
(1973), 907-913], we can omit all the even numbers (except 2) in the first row, if 
we bring values of the form dj/ 2 e into the computation e steps earlier. 

The simplest case of the m-ary method is the binary method (m = 2), 
when the general scheme (4) simplifies to the “S” and “X” rule mentioned at the 
beginning of this section: The binary addition chain for 2 n is the binary chain 
for n followed by 2n; for 2n +1 it is the binary chain for 2 n followed by 2n + 1. 
From the binary method we conclude that 

/(2 e ° + 2 ei + ••■ + 2 e <) < e 0 + t, if co > ex > ... > e* > 0. (5) 

Let us now define two auxiliary functions for convenience in our subsequent 
discussion: 

X(n) = [lgnJ; (6) 

u(n) = number of l’s in the binary representation of n. (7) 

Thus X(17) = 4, i^(17) = 2; these functions may be defined by the recurrence 
relations 


X(l) = 0, X(2n) = X(2n + 1) = X(n) + 1; 

y(l) — 1, i/(2n) = v(ri), v(2n + 1) = t'(n) -j- 1. 


(8) 

(9) 
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In terms of these functions, the binary addition chain for n requires exactly 
X(n) -j- u(n) — 1 steps, and (5) becomes 

l(n) < X(n) v{n) — 1. (10) 


Special classes of chains. We may assume without any loss of generality that an 
addition chain is “ascending,” 

1 = a 0 < ai < a 2 < • • • < a r — n. (11) 

For if any two a’s are equal, one of them may be dropped; and we can also 
rearrange the sequence (1) into ascending order and remove terms > n without 
destroying the addition chain property (2). From now on we shall consider only- 
ascending chains, without explicitly mentioning this assumption. 

It is convenient at this point to define a few special terms relating to addition 
chains. By definition we have, for 1 < i < r, 

a t — cij -f- dk (12) 

for some j and k, 0 < k < j < i. Let us say that step i of (11) is a doubling, if 
j — k = i — 1; then di has the maximum possible value 2ai_i that can follow 
the ascending chain 1, d\, ..., a 2 _i. If j (but not necessarily k) equals i — 1, 
let us say that step % is a star step. The importance of star steps is explained 
below. Finally let us say that step i is a small step if X(a^) = X(oi—i). Since 
di_ i < di < 2 di—i, the quantity X(a z ) is always equal to either X(a^_i) or 
X(<ij_i) -j- 1; it follows that, in any chain (11), the length r is equal to \(n) plus 
the number of small steps. 

Several elementary relations hold between these types of steps: Step 1 is 
always a doubling. A doubling obviously is a star step, but never a small step. 
A doubling must be followed by a star step. Furthermore if step i is not a small 
step, then step z -(- 1 is either a small step or a star step, or both; putting this 
another way, if step z —|— 1 is neither small nor star, step i must be small. 

A star chain is an addition chain that involves only star steps. This means 
that each term di is the sum of a*_i and a previous d^; the simple “computer” 
discussed above after Eq. (2) makes use only of the two operations STA and ADD 
(not LDA) in a star chain, since each new term of the sequence utilizes the pre¬ 
ceding result in the accumulator. Most of the addition chains we have discussed 
so far are star chains. The minimum length of a star chain for n is denoted by 
P(n); clearly 

l(n) < r{n). (13) 

We are now ready to derive some nontrivial facts about addition chains. First 
we can show that there must be fairly many doublings if r is not far from X(n). 
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Theorem A. If the addition chain (11) includes d doublings and f = r — d 
nondoublings, then 

n < 2 d ~ 1 F !+3 . (14) 

Proof. By induction on r = d-\- /, we see that (14) is certainly true when r = 1. 
When r > 1, there are three cases: If step r is a doubling, then \n = a r _i < 
2 d_2 F/ +3 ; hence (14) follows. If steps r and r — 1 are both nondoublings, then 
a r _i < 2 d ~ 1 i ? /+2 and a r _ 2 < 2 d ~ 1 F f+1 ; hence n = a r < a r _i + a r _ 2 < 
2 d— l (Ffj r 2 -f F/+i) = 2 d_1 F/_|_ 3 by the definition of the Fibonacci sequence. 
Finally, if step r is a nondoubling but step r — 1 is a doubling, then a r _ 2 < 
2 d ~ 2 Ff + 2 and n = a r < i + 2 = 3a r _ 2 . Now 2F/ + 3 — 3F/ +2 = 

F/_j_i —F/ > 0; hence n < 2 d— 1 F/_|_ 3 in all cases. | 

The method of proof we have used shows that inequality (14) is “best 
possible” under the stated assumptions; the addition chain 

1,2,..., 2 d ~ 1 ,2 d ~ 1 F 3 ,2 d ~ l F 4 , 2 d - 1 f> +3 (15) 

has d doublings and / nondoublings. 

Corollary. If the addition chain (11) includes f nondoublings and s small steps, 
then 

s < f < 3.271s. (16) 

Proof. Obviously s < f. We have 2 x ( n ) < n < 2 d ~ 1 Ff^. 3 < 2 d (f>f = 
2 x ( n )+ s (0/2)^, since d -\- f = X(n) + s, and since F/+3 < 2<pf when / > 0. 
Hence 0 < sin2 + /ln(0/2), and (16) follows from the fact that 

In 2/ ln(2/0) ~ 3.2706. | 

Values of l(n) for special n. It is easy to show by induction that a t < 2\ and 
therefore lgn < r in any addition chain (11). Hence 

l(n) > rignl. (17) 

This lower bound, together with the upper bound (10) given by the binary 
method, gives us the values 

l(2 A ) = A ; ( 18 ) 

1{2 a + 2 b ) = A+l, if A>B. (19) 

In other words, the binary method is optimum when i/(n) < 2. With some 
further calculation we can extend these formulas to the case i/(n) = 3: 
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Theorem B. 


l(2 A + 2 B + 2 c ) = A + 2, if A>B>C. (20) 

Proof. We can, in fact, prove a stronger result that will be of use to us later 
in this section: All addition chains with exactly one small step have one of the 
following six types (where all steps indicated by represent doublings): 

Type 1. 1, ..., 2 a , 2 a + 2 B , ..., 2 A + C + 2 B + C ■ A > B > 0, C > 0. 

Type 2. 1, ..., 2 A , 2 A + 2 s , 2 A+1 + 2 s ,..., 2 A + C+1 + 2 B+C ; A > B > 0, 
C > 0. 

Type 3. 1, ..., 2 A , 2 A + 2 A - X , 2 A+1 + 2 A ~\ 2 A + 2 , ..., 2 A+C ; A > 0, 
C > 2. 

Type 4. 1, ..., 2 A , 2 A + 2 A ~\ 2 A+1 +2 A , 2 A + 2 , ..., 2 A + C ; A > 0, C > 2. 
Type 5. 1, ..., 2 A , 2 A + 2 A ~ 1 i ..., 2 A + C -j- 2 A + C— 1 , 2 A + C+1 -f- 2 A+C— 2 , 
..., 2 a + c+d+1 + 2 a + c + I) - 2 ; A > 0, C > 0, D > 0. 

Type 6. 1, ..., 2 A , 2 A + 2 s , 2 A + J , ..., 2 A + C ; A > B > 0, C > 1. 

A straightforward hand calculation shows that these six types exhaust all 
possibilities. (Note that, by the corollary to Theorem A, there are at most three 
nondoublings when there is one small step; this maximum of three is attained 
only in sequences of Type 3. All of the above are star chains, except Type 6 
when B < A — 1.) 

The theorem now follows from the observation that l( 2 A -(-2 B -}-2 c ) < A+2; 
and £(2 a -f- 2 s -f- 2 C ) must be greater than A-\- 1, since none of the six possible 
types have v{n) >2. | 

(E. de Jonquieres stated without proof in 1894 that l(n) > X(n) -f 2 when 
u(n) > 2. The first published demonstration of Theorem B was by A. A. Gioia, 
M. Y. Subbarao, and M. Sugunamma in Duke Math. J. 29 (1962), 481-487.) 

The calculation of /(2 A -f 2 s -|- 2 C + 2 D ), when A > B > C > D, is 
more involved; by the binary method it is at most A -\- 3, and by the proof of 
Theorem B it is at least A -f 2. The value A -|- 2 is possible, since we know 
that the binary method is not optimal when n = 15 or n — 23. The complete 
behavior when u(n) = 4 can be determined, as we shall now see. 

Theorem C. If iy(n) > 4 then l(n ) > \(n) -f- 3, except in the following cir¬ 
cumstances when A > B > C > D and l(2 A 2 s -{- 2 C + 2 D ) equals A -j- 2: 

Case 1. A — B = C — D. (Example: n — 15.) 

Case 2. A — B = C — D -\- 1. (Example: n = 23.) 

Case 3. A — B = 3, C — D — 1. (Example: n = 39.) 

Case 4■ A — B = 5, B — C = C — 0 = 1. (Example: n = 135.) 

Proof. When l(n) = \(n) -f- 2, there is an addition chain for n having just two 
small steps; such an addition chain starts out as one of the six types in the proof 
of Theorem B, followed by a small step, followed by a sequence of nonsmall steps. 
Let us say that n is “special” if n = 2 A -f- 2 s -f- 2 C -(- 2° for one of the four 
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cases listed in the theorem. We can obtain addition chains of the required form 
for each special n, as shown in exercise 13; therefore it remains for us to prove 
that no chain with exactly two small steps contains any elements with is(a x ) > 4 
except when a x is special. 

Let a “counterexample chain” be an addition chain with two small steps 
such that i/(a r ) > 4, but a r is not special. If counterexample chains exist, let 
1 = a 0 < cli < ••• < a r = n be a counterexample chain of shortest possible 
length. Then step r is not a small step, since none of the six types in the proof 
of Theorem B can be followed by a small step with i/(n) > 4 except when n is 
special. Furthermore, step r is not a doubling, otherwise ao, ..., a r —i would 
be a shorter counterexample chain; and step r is a star step, otherwise ao, ..., 
a r _ 2 , a r would be a shorter counterexample chain. Thus 

a r = a r —x &r—k ^ 2; and X(a r ) = X(u r —x) —f-1. (21) 

Let c be the number of carries that occur when a r _x is added to a r — k in 
the binary number system by Algorithm 4.3.1 A. Using the fundamental relation 

is(a r ) = u(a r - x) + v(a r _ k ) — c, ( 22 ) 

we can prove that step r — 1 is not a small step (see exercise 14). 

Let m = X(a r _x). Since neither r nor r — 1 is a small step, c > 2 ; and 
c = 2 can hold only when a r _x > 2 m -(- 2 m ~ 1 . 

Now let us suppose that r — 1 is not a star step. Then r — 2 is a small step, 
and ao, ..., a r _ 3 , a r _x is a chain with only one small step; hence v(a r — x) < 2 
and ^(a r _ 2 ) < 4. The relation (22) can now hold only if u(a r ) = 4, v(a r — x) = 2, 
k = 2, c = 2, z/(a r _ 2 ) = 4. From c = 2 we conclude that a r _x = 2 m 4- 2 m—1 ; 
hence a 0 ? Gi, •• •, G r _ 3 = 2 m ~ 1 -(- 2 m “ 2 is an addition chain with only one 
small step, and it must be of Type 1, so a r belongs to Case 3. Thus r — 1 is a 
star step. 

Now assume that a r _x = 2 t a r ^ k for some t. If i/(a r _x) < 3, then by (22), 
c = 2 , k = 2 , and we see that a r must belong to Case 3. On the other hand, 
if v(a r _ x) — 4 then a r _ x is special, and it is easy to see by considering each 
case that a r also belongs to one of the four cases. (Case 4 arises, for example, 
when a r _i = 90, a r — k = 45; or a r _x = 120, a r -~ k = 15.) Therefore we may 
conclude that a r _x 7 ^ 2 t a r ^. k for any t. 

We have proved that a r _ 1 = o r _ 2 + a r — q for some q > 2. If k = 2, then 
q > 2 , and ao, a 1? ..., a r _ 2 , 2 a r _ 2 , 2 a r _ 2 + a r — q = a r is a counterexample 
sequence in which k > 2 ; therefore we may assume that k > 2 . 

Let us now suppose that Xfar-—^) = m — 1; the case Xfo^—fc) < m — 1 
may be ruled out by similar arguments, as shown in exercise 14. If k = 4, both 
r — 2 and r — 3 are small steps; hence a r _ 4 = 2 m ~ 1 , and (22) is impossible. 
Therefore k — 3; step r — 2 is small, i/(a r _ 3 ) = 2, c = 2, a r _i > 2 m + 2 m— 
and i/(or_i) = 4. There must be at least two carries when a r _ 2 is added 
to a r _x —a r _ 2 ; hence i^(a r _ 2 ) = 4, and a r _ 2 (being special and > ^a r _x) 
has the form 2 m ~ 1 + 2 m ~~ 2 + 2 d+1 -f- 2 d for some d. Now a r _ 1 is either 
2 m + 2 m+1 + 2 d+1 + 2 d or 2 m + 2" 1 - 1 + 2 d + 2 + 2 d+1 , and in both cases a r _ 3 
must be 2 m ~ 1 -j- 2 m ~ 2 , so a r belongs to Case 3. | 
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E. G. Thurber [Pacific J. Math. 49 (1973), 229-242] has extended Theorem C 
to show that l(n) > A(n) -(- 4 when v(ri) > 8. It seems reasonable to conjecture 
that l(n) > A(n) + lg v(n) in general, since A. Schonhage has come very close to 
proving this (see exercise 29). 

Asymptotic values. Theorem C indicates that it is probably quite difficult to get 
exact values of l(n) for large n, when v{n) > 4; however, we can determine the 
approximate behavior in the limit as n —► oo. 

Theorem D (A. Brauer, Bull. Amer. Math. Soc. 45 (1939), 736-739). 

lim T(n)/\(n) = lim l(n)/\(n) = 1. (23) 


Proof. The addition chain (4) for the 2 fc -ary method is a star chain if we delete 
the second occurrence of any element that appears twice in the chain; for if a t 
is the first element among 2do, 4do> ... of the second line that is not present in 
the first line, we have a z < 2(m — 1); hence a % = (m — 1) -j - a 3 for some a 3 in 
the first line. By totaling up the length of the chain, we have 


\(n) < l(n) < l*(n) < 



lgn + 2 fc 


(24) 


for all k > 1. The theorem follows if we choose, say, k = lgX(n)J. | 

If we let k — XX(n) — 2XXX(n) in (24) for large n, where XX(n) denotes 
X(X(n)), we obtain the stronger asymptotic bound 

l(n ) < r(n) < \{n) + X(n)/XX(n) -f 0(X(n)XXX(n)/XX(n) 2 ). (25) 

The second term X(n)/XX(n) is essentially the best that can be obtained from 
(24). A much deeper analysis of lower bounds can be carried out, to show that 
this term A(n)/XX(n) is, in fact, essential in (25). In order to see why this is so, 
let us consider the following fact: 

Theorem E (Paul Erdos, Acta Arithmetica 6 (1960), 77-81). Let e be a positive 
real number. The number of addition chains (11) such that 


X(n) = m, r < m -]- (1 — e)m/\(m) (26) 


is less than a m , for some a < 2, for all suitably large m. (In other words, the 
number of addition chains so short that (26) is satisfied is substantially less than 
the number of values of n such that X(n) = m, when m is large.) 

Proof. We want to estimate the number of possible addition chains, and for this 
purpose our first goal is to get an improvement of Theorem A that enables us to 
deal more satisfactorily with nondoublings. 
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Lemma P. Let 8 < \/2 — 1 be a fixed positive real number . Call step i of an 
addition chain a “misstep” if it is not a doubling and if a l < a ; (l + 8) l ~ 3 for 
some j, where 0 < j < i. If the addition chain contains s small steps and t 
ministeps, then 

t<s/(\ — 0), where (1 -j- 8) 2 — 2 9 . (27) 

Proof. For each ministep 1 < k < t, we have a* fc < CL Jk (l -j- <5) Zfc Jk for some 
jk < ik- Let Ii, ..., I t be the intervals ..., (jt, it], where the notation 

(j, i ] stands for the set of all integers k such that j < k < i. It is possible (see 
exercise 17) to find nonoverlapping intervals J\, ..., Jh = (j\ , i \}, ..., (j' h , i' h ] 
such that 

h U • • • U It = J\ U • * * U Jh, 

-m (28) 

a % ' k < dj' k { 1 -f^) 2 tfc Jk \ for 1 < k < h. 

Now for all steps i outside of the intervals J \, ..., Jh we have m < 2a* ; hence 

if we let 

<7 = (4 “ /i) H-h (4 — 3h)> 

we have 

2 x ( n ) < n < 2 r—g (l + <5) 2q = 2 x ( n )+ s ~( 1—1 < 2 x ( n )+ s— ( 1—1 G* | 

Returning to the proof of Theorem E, let us choose 6 = 2 e / 4 — 1, and let 
us divide the r steps of each addition chain into three classes: 


t ministeps, u doublings, v other steps, t u + v = r. (29) 

Counting another way, we have s small steps, where s -\-m = r. By the hypoth¬ 
eses, Theorem A, and Lemma P, we obtain the relations 

t < 5/(1 — e/2), t + ^ <5 3.271s, s < (1 — e)m/X(m). (30) 

Given s, t, u, v satisfying these conditions, there are at most 



ways to assign the steps to the specified classes. Given such a distribution of 
the steps, let us consider how the non-ministeps can be selected: If step i is 
one of the “other” steps in (29), a* > (1 -f 8)ai—i, so a* — a 3 -f a/ c , where 
8a t — 1 < ak < a 3 < a l — 1 . Also a 3 < a t /(\ + 8) l ~ 3 < 2a*_ 1 /(l -|- <5) 1 ~' ? , so 
5 < 2/(1 <5) l—J . This gives at most (3 choices for j, where (3 is a constant 

that depends only on <5. There are also at most (3 choices for k, so the number 
of ways to assign j and k for each of the non-ministeps is at most 



( 32 ) 
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Finally, once the “j ” and “fc” have been selected for each of the non¬ 
ministeps, there are fewer than 



ways to choose the j and the k for the ministeps: We select t distinct pairs 
(ji,ki), ..., (j t , k t ) of indices in the range 0 < k h < j h < r, in fewer than (33) 
ways. Then for each ministep z, in turn, we use a pair of indices (jh, kh) such 
that 

a) jh < i; 

b) dj h -\-ak h is as small as possible among the pairs not already used for smaller 
ministeps i; 

c) Gi = a jh + dk h satisfies the definition of ministep. 

If no such pair (jh, kh) exists, we get no addition chain; on the other hand, any 
addition chain with ministeps in the designated places must be selected in one 
of these ways, so (33) is an upper bound on the possibilities. 

Thus the total number of possible addition chains satisfying (26) is bounded 
by (31) times (32) times (33), summed over all relevant s, t, u, and v. The proof 
of Theorem E can now be completed by means of a rather standard estimation 
of these functions (exercise 18). | 

Corollary. The value of l(n ) is asymptotically \(n) -f X(n)/XX(n), for almost 
all n. More precisely, there is a function f(n) such that f(n) —> 0 as n —> oo, 
and 

Pr( |/(n) — \(n) — X(n)/XX(n)| > f(n)\(n)/\\(n) ) = 0. (34) 

(See Section 3.5 for the definition of this probability “Pr”.) 

Proof. The upper bound (25) shows that (34) holds without the absolute value 
signs. The lower bound comes from Theorem E, if we let f(n) decrease to zero 
slowly enough so that, when f(n) < e, the value N is so large that at most eN 
values n < N have l(n) < \(n) -j- (1 — e)X(n)/XX(n). | 

Star chains. Optimistic people find it reasonable to suppose that l(n) = l*(n ); 
given an addition chain of minimal length l(ri), it appears hard to believe that 
we cannot find one of the same length that satisfies the (apparently mild) star 
condition. But in 1958 Walter Hansen proved the remarkable theorem that, for 
certain large values of n, the value of l(n) is definitely less than l*(n), and he 
also proved several related theorems that we shall now investigate. 

Hansen’s theorems begin with an investigation of the detailed structure 
of a star chain. This structure is given in terms of other sequences and sets 
constructed from the given chain. Let n = 2 e ° -j- 2 ei -j- • • • -\- 2 et , eo > e\ > 

• • • > e t > 0, and let 1 = a 0 < d\ < • • • < a r = n be a star chain for n. If 
there are d doublings in this chain, we define the auxiliary sequence 


0 = d 0 < di < d 2 < ■ ■ • < d r = d, 


(35) 
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where d* is the number of doublings among steps 1, 2, ..., i. We also define 
a sequence of “multisets” So, Si, ..., S r , which keep track of the powers of 2 
present in the chain. (A multiset is a mathematical entity that is like a set, 
but it is allowed to contain repeated elements; an object may be an element 
of a multiset several times, and its multiplicity of occurrences is relevant. See 
exercise 19 for familiar examples of multisets.) The multisets S t are defined by 
the rules 

a) So = {0}; 

b) If a*+i = 2a*, then S* +i =S* + 1 = {x + 1|2:GS*}; 

c) If a*_|_j = a* + a k , k < i, then S i+1 = S* l+J S k . 

(The symbol t±J means that the multisets are combined, adding the multi¬ 
plicities.) Prom this definition it follows that 

<H = ^2 2 1 , (36) 

xeSi 

where the terms in this sum are not necessarily distinct. In particular, 

n = 2 e ° + 2 ei H-1- 2 et = 2 X . (37) 

x£S r 

The number of elements in the latter sum is at most 2^, where / = r — d is the 
number of nondoublings. 

Since n has two different binary representations in (37), we can partition the 
multiset S r into multisets M 0 , Mi, ..., M t such that 

2'j = 2 X , 0 <j<t. (38) 

x£Mj 

This can be done by arranging the elements of S r into nondecreasing order < 
X 2 < ••• and taking M t = {x±, x 2 ,..., x k }, where 2 Xl + • • • -f- 2 Xk = 2 €t . 
This must be possible, since e t is the smallest of the e’s. Similarly, M t —i = 
{xfc+i, .. •, £fc'}, and so on; the process is easily visualized in binary nota¬ 
tion. 

Let Mj contain m 3 elements (counting multiplicities); then rrij < 2 f — t, 
since S r has at most 2^ elements and it has been partitioned into t -\-1 nonempty 
multisets. By Eq. (38), we can see that 

e j > x > ej — rrijj for all x £ Mj. (39) 

Our examination of the star chain’s structure is completed by forming the 
multisets M^ that record the ancestral history of Mj. The multiset Si is 
partitioned into t - f-1 multisets as follows: 

a) M r j = Mj ; 

b) If a*_|_i = 2a*, then M*y = M(* +1 )j — \ = { x — l\x£ M(* +1):? - }; 

c) If a* + i = a* + a k, k < iy then (since 5*+i = 5* 1±) Sk) we let M t j = 

M(i+i)j minus S k) that is, we remove the elements of S k from If 

some element of S k appears in two or more different multisets M(*_|_ i^-, we 
remove it from the set with the largest possible value of j\ this rule uniquely 
defines Mi 3 for each j, when i is fixed. 
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From this definition it follows that 


ej 4- di — d > x > ej -f- d t — d — rrij , for all x £ M t j . (40) 


As an example of this detailed construction, let us consider the star chain 
1, 2, 3, 5, 10, 20, 23, for which t — 3, r = 6, d = 3, / = 3. We obtain the 
following array of multisets: 


(do, di ,..., d 6 ) 

(do, ai,...,d6) 

(M 03 , M 13 ,..., M 63 ) 

(M 0 2 , Mi 2, - -., M62) 

(M01, Mu,..., Moi) 

(Moo, M10,..., Mgo) 

Sq Si S2 S3 <$4 5.5 5 o 


0 

— 

1 

1 

1 

2 

3 

3 

1 

2 

3 

5 

10 

20 

23 







0 







1 



0 

0 

1 

2 

2 

0 

1 

1 

1 

2 

3 

3 




1 

2 

3 

3 


M 3 

m 2 

Mi 

M 0 


e 3 = 0, m 3 = 1 

e 2 = 1, m 2 = 1 

ei = 2, mi = 1 

e 0 = 4, m 0 — 2 


Thus M 4 q = {2, 2}, etc. From the construction we can see that di is the largest 
element of hence 

d z £ M l0 . (41) 

The most important part of this structure comes from Eq. (40); one of its 
immediate consequences is 

Lemma K. If M tJ and M uv both contain a common integer x, then 

Wly ^ ) ( d u d-j) <C TYlj. | 


Although Lemma K may not look extremely powerful, it says (when M tJ 
contains an element in common with M uv and when rrij, m v are reasonably 
small) that the number of doublings between steps u and i is approximately 
equal to the difference between the exponents e v and e 7 . This imposes a certain 
amount of regularity on the addition chain; and it suggests that we might be able 
to prove a result analogous to Theorem B above, that l*(n) = eo + 1, provided 
that the e 3 are far enough apart. The next theorem shows how this can be done. 

Theorem H (W. Hansen, J. fur die reine und angew. Math. 202 (1959), 129-136). 
Let n = 2 eo + 2 ei -f- \- 2 e % e 0 > Ci > • • • > e t > 0. If 


e 0 > 2ei -|- 2.271(i — 1) and e z _i > ei -f- 2m for 1 < i < t, (43) 
where m = 2L 3 - 271 ( i —FJ — t, then l*(n) = e 0 + t. 

Proof. We may assume that t > 2, since the result of the theorem is true without 
restriction on the e’s when t < 2. Suppose that we have a star chain l = do < 
ai < • • • < d r = n for n with r < e 0 +1 — 1. Let the integers d, /, d 0 , ..., d r , 





456 ARITHMETIC 


4.6.3 


and the multisets M i3 and Si, reflect the structure of this chain, as defined above. 
By the corollary to Theorem A, we know that / < [3.271(t — 1)J; therefore the 
value of m is a bona fide upper bound for the number m 3 of elements in each 
multiset Mj. 

In the summation 


Q>i — 


£ 2 *) + ( £ 2 *) + -+( £ 2 1 ), 

x£M t0 ' V XCM U ' ' 


no carries propagate from the term corresponding to M io to the term correspond¬ 
ing to Mkj— i), if we think of this sum as being carried out in the binary number 
system, since the e’s are so far apart. (Cf. (40).) In particular, the sum of all 
the terms for j ^ 0 will not carry up to affect the terms for j = 0, so we must 
have 

a, > £ 2 X > 2 x < a '\ 0 < i < r. (44) 

In order to prove Theorem H, we would like to show that in some sense the 
t extra powers of n must be put in “one at a time,” so we want to find a way to 
tell at which step each of these terms essentially enters the addition chain. 

Let j be a number between 1 and t. Since M 0j is empty and M rj = Mj is 
nonempty, we can find the first step i for which is not empty. 

From the way in which the M%j are defined, we know that step i is a non¬ 
doubling: a,i — di —i + a u for some u < i — 1. We also know that all the 
elements of Mij are elements of S u . We will prove that a u must be relatively 
small compared to a*. 

Let Xj be an element of Then since x 3 £ S u , there is some v for which 
Xj £ M uv . It follows that 

di du 171, (45) 

i.e., at least m -£ 1 doublings occur between steps u and i. For if di — d u < m, 
Lemma K tells us that \e 3 — e v \ < 2m; hence v = j. But this is impossible, 
because M u j is empty by our choice of step i. 

All elements of S u are less than or equal to e\ + di — d. For if x £ S u C Si 
and x > ei + di — d, then x £ M u0 and x £ M z0 by (40); so Lemma K implies 
that \di — d u | < m, contradicting (45). In fact, this argument proves that M i0 
has no elements in common with S u , so M(i_i) 0 = M l0 . From (44) we have 
a^_! > 2 x ^ a ^, and therefore step i is a small step. 

We can now deduce what is probably the key fact in this entire proof: All 
elements of S u are in M u0 . For if not, let x be an element of S u with x £ M u q. 
Since x > 0, (40) implies that e\ > d — d u , hence 


e 0 = f + d — s < 2.2715 + d < 2.271 (t — 1) + e t + d u . 

By hypothesis (43), this implies d u > e x . But d u £ S u by (41), and it cannot be 
in Mio, hence d u < e\ d{ — d < a contradiction. 
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Going back to our element x 3 in My, we have x 3 (E M uv \ and we have 
proved that v — 0. Therefore, by equation (40) again, 

eo + d u — d>Xj>e 0 + d u — d. (46) 

For all j = 1, 2, ..., t we have determined a number x 3 satisfying (46), 
and a small step i at which the term 2 €j may be said to have entered into the 
addition chain. If j j', the step i at which this occurs cannot be the same 
for both j and f; for (46) would tell us that \x 3 — x 3 >\ < m, while elements of 
Mi 3 and M l3 > must differ by more than ra, since e 3 and e 3 > are so far apart. We 
are forced to conclude that the chain contains at least t small steps; but this is 
a contradiction. | 

Theorem F (W. Hansen). 

1{2 a J r xy)<A + iy(x) -f v(y) — 1, if X(x) + \{y) < A. (47) 

Proof. An addition chain (which is not a star chain in general) may be con¬ 
structed by combining the binary and factor methods. Let x = 2 Xl -j - 2 Xu 

and y = 2 Vl -f-[- 2 Vv , where X\ > • • • > x u > 0 and yi > • • • > y v > 0. 

The first steps of the chain form successive powers of 2, until 2 A ~ Vl is 
reached; in between these steps, the additional values 2 Xu ~ 1 -j- 2 Xu , 2 Xu ~ 2 -f- 
2 Xu ~ 1 -f- 2 Iu , ..., and x are inserted in the appropriate places. After a chain up 

to 2 a ~ Vl -\-x{2 Vl ~ Vi -j- \^2 yi - 1 ~ Vi ) has been formed, we continue by adding x 

and doubling the resulting sum y % — yi+\ times; this yields 

2 A~y i + 1 _|_ x ( 2 y i— y i+i _|_|_ 2 Vi ~ yi + 1 ) 

* 

If this construction is done for i = 1, 2, ..., v, assuming for convenience that 
y v +\ = 0, we have an addition chain for 2 A -\- xy as desired. | 

Theorem F enables us to find values of n for which l(n) < l*(n), since 
Theorem H gives an explicit value of l*(n) in certain cases. For example, let 
x = 2 1016 4~ 1, y = 2 2032 + 1, and let 

n = 2 6103 + xy = 2 6103 + 2 3048 + 2 2032 + 2 1016 + 1. 

According to Theorem F, we have l(n) < 6106. But Theorem H also applies, 
with m — 508, and this proves that l*(n) — 6107. 

Extensive computer calculations have shown that n = 12509 is the smallest 
value with l(n ) < No star chain for this value of n is as short as the 

sequence 1, 2, 4, 8, 16, 17, 32, 64, 128, 256, 512, 1024, 1041, 2082, 4164, 8328, 
8345, 12509. The brute-force methods in the proof of Theorem C could be 
extended by computer program to determine all n such that l{n) = X(n) -(- 3; 
this approach would also characterize all n with u(n) = 5 and l(n) ^ l*(n). 
(The smallest such n is 16537 = 2 14 -f- 9 -17.) 
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Some conjectures. Although it was reasonable to guess at first glance that l(n) = 
l*(n), we have now seen that this is false. Another plausible conjecture [first 
made by A. Goulard, and supposedly “proved” by E. de Jonquieres in l’Interm. 
des Math. 2 (1895), 125-126] is that l(2n) — l(n ) -j- 1; a doubling step is so 
efficient, it seems unlikely that there could be any shorter chain for 2 n than 
to add a doubling step to the shortest chain for n. But computer calculations 
show that this conjecture also fails, since /(191) = /(382) = 11. (A star chain 
of length 11 for 382 is not hard to find; e.g., 1, 2, 4, 5, 9, 14, 23, 46, 92, 184, 
198, 382. The number 191 is minimal such that l(n) — 11, and it seems to be 
very difficult to prove by hand that Z(191) > 10; the computer’s proof of this 
fact, using a backtrack method that will be sketched in Section 7.2.2, involved 
a detailed examination of 948 cases.) The smallest four values of n such that 
l(2n) — l(n ) are n = 191, 701, 743, 1111; E. G. Thurber proved in Pacific J. 
Math. 49 (1973), 229-242, that the third of these is a member of an infinite 
family of such n, namely 23 • 2 k + 7 for all k > 5. It seems reasonable to 
conjecture that l(2n) > /(n), but even this may be false. Kevin R. Hebb has 
shown that l(n) — l(mn) can get arbitrarily large, for all fixed integers m not a 
power of 2 [ Notices Amer. Math. Soc. 21 (1974), A-294]. The smallest case in 
which l(mn) < l{n) is /((2 13 -j- l)/3) = 15. 

Let c(r) be the smallest value of n such that l(n) — r. It seems to be most 
difficult to compute l(n) for these values of n. We have the following table: 


r 

c(r) 

r 

c(r) 

r 

c(r) 

1 

2 

7 

29 

13 

607 

2 

3 

8 

47 

14 

1087 

3 

5 

9 

71 

15 

1903 

4 

7 

10 

127 

16 

3583 

5 

11 

11 

191 

17 

6271 

6 

19 

12 

379 

18 

11231 


For r < 11, the value of c(r) is approximately equal to c(r — 1) c(r — 2), and 
this fact led to speculation by several people that c(r) grows like the function 
(j> r ; but the result of Theorem D (with n — c(r)) implies that r/lgc(r) —► 1 
as n —► oo. [See E. G. Thurber, Duke Math. J. 40 (1973), 907-913, for more 
detailed information about the growth of c(r).] Several people had conjectured 
at one time that c(r) would always be a prime number; but c( 15) = 11-173 and 
c(18) = 11 • 1021. Perhaps no conjecture about addition chains is safe! 

Tabulated values of l(ri) show that this function is surprisingly smooth; for 
example, l(n) = 13 for all n in the range 1125 < n < 1148. The computer 
calculations show that a table of l(n) may be prepared for all n < 1000 by using 
the formula 

l(n) = mm(l(n — 1) + 1, l) — 6, (48) 

where l = oo if n is prime, otherwise l = l(p) -j- l(n/p) if p is the smallest prime 
dividing n; and 6 = 1 for n in Table 1, 8 = 0 otherwise. 



4.6.3 


EVALUATION OF POWERS 459 


Table 1 

VALUES OF n FOR SPECIAL ADDITION CHAINS 


23 

163 

229 

319 

371 

413 

453 

553 

599 

645 

707 

741 

813 

849 

903 

43 

165 

233 

323 

373 

419 

455 

557 

611 

659 

709 

749 

825 

863 

905 

59 

179 

281 

347 

377 

421 

457 

561 

619 

667 

711 

759 

835 

869 

923 

77 

203 

283 

349 

381 

423 

479 

569 

623 

669 

713 

779 

837 

887 

941 

83 

211 

293 

355 

382 

429 

503 

571 

631 

677 

715 

787 

839 

893 

947 

107 

213 

311 

359 

395 

437 

509 

573 

637 

683 

717 

803 

841 

899 

955 

149 

227 

317 

367 

403 

451 

551 

581 

643 

691 

739 

809 

845 

901 

983 


Let d(r) be the number of solutions n to the equation Z(n) 
table lists the first few values of this function: 

= r. The following 

r 

d(r) 

r 

d(r) 

r 

d{r) 

1 

1 

6 

15 

11 

246 

2 

2 

7 

26 

12 

432 

3 

3 

8 

44 

13 

772 

4 

5 

9 

78 

14 

1382 

5 

9 

10 

136 

15 

2481 


Surely d(r) must be an increasing function of r, but there is no evident way to 
prove this seemingly simple assertion, much less to determine the asymptotic 
growth of d(r) for large r. 

The most famous problem about addition chains that is still outstanding is 
the “Scholz-Brauer conjecture,” which states that 

/(2 n - 1) < n - 1 -f l(n). (49) 

Computer calculations show, in fact, that equality holds in (49) for 1 < n < 14; 
and hand calculations by E. G. Thurber [Discrete Math. 16 (1976), 279-289] 
have shown that equality holds also for n = 15, 16, 17, 18, 20, 24, 32. Much 
of the research on addition chains has been devoted to attempts to prove (49); 
addition chains for the number 2 n — 1, which has so many ones in its binary 
representation, are of special interest, since this is the worst case for the binary 
method. Arnold Scholz coined the name “addition chain” (in German) and 
posed (49) as a problem in 1937 [Jahresbericht der deutschen Mathematiker- 
Vereinigung, class II, 47 (1937), 41-42]; Alfred Brauer proved in 1939 that 

/*(2 n -l) < to —l + /*(n). (50) 

Hansen’s theorems show that /(n) can be less than Z*(n), so more work is 
definitely necessary in order to prove or disprove (49). As a step in this direction, 
Hansen has defined the concept of an l°-chain, which lies “between” /-chains and 
/“"-chains. In an /°-chain, certain of the elements are underlined; the condition 
is that cii = a,j 4- afc, where a 3 is the largest underlined element less than a % . 
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As an example of an l°-chain (certainly not a minimum one), consider 

1,2,4,5,8,10,12,18; (51) 

it is easy to verify that the difference between each element and the previous 
underlined element is in the chain. We let l°(n) denote the minimum length of 
an /°-chain for n. Clearly l(n) < l°(n) < l*(n). 

The chain constructed in Theorem F is an /°-chain (see exercise 22); hence 
we have l°(n) < l*(n) for certain n. It is not known whether or not l(n) = l°(n) 
in all cases; if this equation were true, the Scholz-Brauer conjecture would be 
settled, because of another theorem due to Hansen: 

Theorem G. l°(2 n — 1) < n — 1 -j- l°(ri). 

Proof. Let 1 = a 0 , ai, ..., a r — n be an /°-chain of minimum length for n, 
and let 1 = bo, fc>i, ..., b t = n be the subsequence of underlined elements. (We 
may assume that n is underlined.) Then we can get an £°-chain for 2 n — 1 as 
follows: 

a) Include the l°(n)-j- 1 numbers 2 a% — 1, for 0 < i < r, underlined if and only 
if a z is underlined. 

b) Include the numbers 2 l (2^ — 1), for 0 < j < t and for 0 < i < bj +1 — bj, 

all underlined. (This is a total of b x — 6 0 H- \-b t — b t — i = n —1 numbers.) 

c) Sort the numbers from (a) and (b) into ascending order. 

We may easily verify that this gives an /°-chain: The numbers of (b) are all 
equal to twice some other element of (a) or (b); furthermore, this element is the 
preceding underlined element. If — bj-\-a,k, where bj is the largest underlined 
element less than a^, then a^ = aj — bj < b j+1 —bj, so 2 afc (2^ —I) = 2 ai — 2 ak 
appears underlined in the chain, just preceding 2 ai — 1. Since 2 ai — 1 is equal 
to (2 ai — 2 ak ) -f- ( 2 ak — 1), where both of these values appear in the chain, we 
have an addition chain with the l 0 property. | 

The chain corresponding to (51), constructed in the proof of Theorem G, is 

1,2,3,6,12,15,30,31,60,120,240, 255, 510, 1020 .1023, 2040 . 

4080 .4095,8160, 16320 . 32640 . 65280 . 130560 . 261120 . 262143 . 


Graphical representation. An addition chain (1) corresponds in a natural way 
to a directed graph, where the vertices are labeled ai for 0 < i < r, and where 
we draw arcs from a 3 to and from a/c to a, as a representation of each step 
a x = a 3 + fljfe in (2). For example, the addition chain 1, 2, 3, 6, 12, 15, 27, 39, 
78, 79 that appears in Fig. 14 corresponds to the directed graph 
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If di = a 3 -f a,k for more than one pair of indices (j, /c), we choose a definite j 
and k for purposes of this construction. 

In general, all but the first vertex of such a directed graph will be at the 
head of exactly two arcs; however, this is not really an important property of 
the graph, because it conceals the fact that many different addition chains can 
be essentially equivalent. If a vertex has out-degree 1 (i.e., only one arc leading 
away), it is used in only one later step, hence the later step is essentially a sum 
of three inputs dj -f a/c -{ -a m that might be computed either as (a 3 -f a^) + a m 
or as dj -f- (ajt + a m) or &k { a j °m)- These three choices are immaterial, 
but the addition-chain conventions force us to distinguish between them. We 
can avoid such redundancy by deleting any vertex whose out-degree is 1 and 
attaching the arcs from its predecessors to its successor. For example, the above 
graph would become 



We can also delete any vertex whose out-degree is 0, except of course the final 
vertex a r , since such a vertex corresponds to a useless step in the addition chain. 

In this way every addition chain leads to a reduced directed graph that 
contains one “source” vertex (labeled 1) and one “sink” vertex (labeled n); every 
vertex but the source has in-degree > 2 and every vertex but the sink has 
out-degree > 2. Conversely, any such directed graph without oriented cycles 
corresponds to at least one addition chain, since we can topologically sort the 
vertices and write down d — 1 addition steps for each vertex of in-degree d > 0. 
The length of the addition chain, exclusive of useless steps, can be reconstructed 
by looking at the reduced graph; it is 

(number of arcs) — (number of vertices) + 1, (53) 

since deletion of a vertex of out-degree 1 also deletes one arc. 

We say that two addition chains are equivalent if they have the same reduced 
directed graph. For example, the addition chain 1, 2, 3, 6, 12, 15, 24, 39, 40, 79 
is equivalent to the chain we began with, since it also leads to (52). This example 
shows that a non-star chain can be equivalent to a star chain. An addition chain 
is equivalent to a star chain if and only if its reduced directed graph can be 
topologically sorted in only one way. 

An important property of this graph representation has been pointed out 
by N. Pippenger: The label of each vertex is exactly equal to the number of 
oriented paths from the source to that vertex. Thus, the problem of finding an 
optimal addition chain for n is equivalent to minimizing the quantity (53) over 
all directed graphs that have one source vertex and one sink vertex and exactly 
n oriented paths from the source to the sink. 

This characterization has a surprising corollary, because of the symmetry of 
the directed graph. If we reverse the direction of all the arcs, the source and the 



462 ARITHMETIC 


4.6.3 


sink exchange roles, and we obtain another directed graph corresponding to a 
set of addition chains for the same n; these addition chains have the same length 
(53) as the chain we started with. For example, if we make the arrows in (52) 
run from right to left, and if we relabel the vertices according to the number of 
paths from the right-hand vertex, we get 


(54) 

One of the star chains corresponding to this reduced directed graph is 

1, 2, 4, 6, 12, 24, 26, 52, 78, 79; 

we may call this a dual of the original addition chain. 

Exercises 39 and 40 discuss important consequences of this graphical repre¬ 
sentation and the duality principle. 



EXERCISES 

1. [15] What is the value of Z when Algorithm A terminates? 

2. [24] Write a MIX program for Algorithm A, to calculate x n modu> given integers 
n and x , where w is the word size. Assume that MIX has the binary operations SRB, 
JAE, etc., that are described in Section 4.5.2. Write another program that computes 
x n mod w in a serial manner (multiplying repeatedly by x), and compare the running 
times of these programs. 

► 3. [22] How is x 975 calculated by (a) the binary method? (b) the ternary method? 
(c) the quaternary method? (d) the factor method? 

4. [M20] Find a number n for which the octal (2 3 -ary) method gives ten fewer 
multiplications than the binary method. 

► 5. [24] Figure 13 shows the first eight levels of the “power tree.” The (/c —l)-st level 
of this tree is defined as follows, assuming that the first k levels have been constructed: 
Take each node n of the fcth level, from left to right in turn, and attach below it the 
nodes 

n -F 1, n -\- oi, n -(- a 2 , • • •, n ak —1 = 2 n 


(in this order), where 1, 01 , a 2 , ..., ak— 1 is the path from the root of the tree to n; 
but discard any node that duplicates a number that has already appeared in the tree. 

Design an efficient algorithm that constructs the first r -j- 1 levels of the power 
tree. [Hint: Make use of two sets of variables LINKU[j], LINKRfj] for 0 < j < 2 r ; these 
point upwards and to the right, respectively, if j is a number in the tree.] 
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6. [M26] If a slight change is made to the definition of the power tree that is given 
in exercise 5, so that the nodes below n are attached in decreasing order 

n-j-dk- 1 , n-j-a 2 , n + a i, n + 1 

instead of increasing order, we get a tree whose first five levels are 



Show that this tree gives a method of computing x n that requires exactly as many 
multiplications as the binary method; therefore it is not as good as the power tree, 
although it has been constructed in almost the same way. 

7. [ M21 ] Prove that there are infinitely many values of n 

a) for which the factor method is better than the binary method; 

b) for which the binary method is better than the factor method; 

c) for which the power tree method is better than both the binary and factor methods. 

(Here the “better” method shows how to compute x n using fewer multiplications.) 

8. [ M21 ] Prove that the power tree (exercise 5) never gives more multiplications for 
the computation of x n than the binary method. 

9. [ M21\ Design an exponentiation procedure that is analogous to Algorithm A, 
but based on a general radix m > 2. Your method should perform approximately 
lg n -(- v -j- 2m multiplications, where v is the number of nonzero digits in the m-ary 
representation of n. 

10. [10] Figure 14 shows a tree that indicates one way to compute x n with the fewest 
possible multiplications, for all n < 100. How can this tree be conveniently represented 
within a computer, in just 100 memory locations? 

► 11. [ M26] The tree of Fig. 14 depicts addition chains do, ai, ..., a r having Z(a t ) = i 
for all i in the chain. Find all addition chains for n that have this property, when 
n — 43 and when n = 77. Show that any tree such as Fig. 14 must include either the 
path 1, 2, 4, 8, 9, 17, 34, 43, 77 or the path 1, 2, 4, 8, 9, 17, 34, 68, 77. 

12. [M10] Is it possible to extend the tree shown in Fig. 14 to an infinite tree that 
yields a minimum-multiplication rule for computing x n , for all positive integers n ? 

13. [M21] Find a star chain of length A -\- 2 for each of the four cases listed in 
Theorem C. (Consequently Theorem C holds also with l replaced by l*.) 

14. [M35] Complete the proof of Theorem C, by demonstrating that (a) step r — 1 
is not a small step; and (b) \(a r —k) cannot be less than m 1. 

15. [M42] Write a computer program to extend Theorem C, characterizing all n such 
that l(n) — X(n) + 3 and characterizing all n such that l*{n) — A(n) 3. 
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16. [ HM15 ] Show that Theorem D is not trivially true just because of the binary 
method; if l B (n ) denotes the length of the addition chain for n produced by the binary 
S-and-X method, l B (n)/\(n) does not approach a limit as n -> oo. 

17. [M25] Explain how to find the intervals Ji, ..., Jh that are required in the proof 
of Lemma P. 

18. [ HM24 ] Let (3 be a positive constant. Show that there is a constant a < 2 such 
that 

^2 + s ) 2 

for all large m, where the sum is over all s, t, v satisfying (30). 

19. [M23] A “multiset” is like a set, but it may contain identical elements repeated a 
finite number of times. If A and B are multisets, we define new multisets Al+JB, A\JB, 
and A(~\B in the following way: An element occurring exactly a times in A and b times 
in B occurs exactly a -f- b times in Al±) B, exactly max(a, b ) times in A[J B, and exactly 
min(a, 6) times in A fl B. (A “set” is a multiset that contains no elements more than 
once; if A and B are sets, so are A U B and A fl B, and the definitions given in this 
exercise agree with the customary definitions of set union and intersection.) 

a) The prime factorization of an integer n > 0 is a multiset N whose elements are 

primes, where = n - The ever Y positive integer can be uniquely 

factored into primes gives us a one-to-one correspondence between the positive 
integers and the finite multisets of prime numbers; for example, if n = 2 2 • 3 3 • 17, 
the corresponding multiset is N = {2, 2, 3,3, 3,17}. If M and N are the multisets 
corresponding respectively to ra and n, what multisets correspond to gcd(m, n), 
lcm(m, n), and ran? 

b) Every monic polynomial f(z) over the complex numbers corresponds in a natural 
way to the multiset F of its “roots”; we have f(z) — J| fGF (z — ?)• If f( z ) an( f 
g(z) are the polynomials corresponding to the finite multisets F and G of complex 
numbers, what polynomials correspond to F (±J G, F U G, and F C\G? 

c) Find as many interesting identities as you can that hold between multisets, with 
respect to the three operations 1±), U, H- 

20. [ M20} What are the sequences Si and Mij (0 < i < r, 0 < j < t) arising in 
Hansen’s structural decomposition of star chains (a) of Type 3? (b) of Type 5? (The 
six “types” are defined in the proof of Theorem B.) 

► 21. [ M25} (W. Hansen.) Let q be any positive integer. Find a value of n such that 
l(n ) < l*{ri) — q. 

22. [M20] Prove that the addition chain constructed in the proof of Theorem F is an 
/°-chain. 

23. [M20] Prove Brauer’s inequality (50). 

► 24. [ M22 ] Generalize the proof of Theorem G to show that l°((B n — 1 )/{B — 1)) < 
(n — 1) l°(B) + l°{n), for any integer B > 1; and prove that l( 2 wn — 1) < l(2 m — 1) + 
ran — m-\- 

25. [20] Let y be a fraction, 0 < y < 1, expressed in the binary number system 
as y = (.di... d*:) 2 . Design an algorithm to compute x y using the operations of 
multiplication and square-root extraction. 

► 26. [M24] Design an efficient algorithm that computes the nth Fibonacci number F n , 
modulo m, given large integers n and m. 


) 


< CX 


m 
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► 27. [24] (E. G. Straus.) Find a way to compute a general monomial x^x ^ 2 ... x 
in at most 2X(max(ni, 712 ,..., n m )) + 2 m — m — 1 multiplications. 

28. [MSS] (A. Schonhage.) The object of this exercise is to give a fairly short proof 

that l(n) > X(n)+lg^(n)—0(loglog(z/(n)+l)). (a) When x = (x*;... xo-Z -i • • - )2 an d 
y = (yic... yo.y—i ... are real numbers written in binary notation, let us write x C y 
if x 3 < yj for all j. Give a simple rule for constructing the smallest number 2 with the 
property that x' C x and y' C y implies x' + y' C z. Denoting this number by x V y, 
prove that v{x V y) < is(x) + i/(y). (b) Given any addition chain (11) with r = l(n), 
let the sequence d 0 , d\, ..., d r be defined as in (35), and define the sequence A 0 , A\, 
..., A r by the following rules: Aq = 1; if a z = 2di—\ then Ai = 2vL_i; otherwise if 
a t = dj -j- dk for some 0 < k < j < i, then Ai = Ai— 1 V (. A x —\j2 dj ~ dk ). Prove that 
this sequence “covers” the given chain, in the sense that di C A % for 0 < % < r. (c) Let 
6 be a positive integer (to be selected later). Call the nondoubling step di =- dj -j- dk a 
“baby step” if dj — dk > S, otherwise call it a “close step.” Let -Bo = 1; = 2B ;—1 

if — 2d l —i] Bi = B t —\ V (Bi_ x j 2 d ^~~ dk ) if ai = a j dk is a baby step; and 

Bi = p{2B % —i) otherwise, where p(x) is the least number y such that x/2 e C y for 
0 < e < 6 . Show that Ai C Bi and i^(Bi) < (1 + Sd)2 bi for 0 < i < r, where b z 
and d respectively denote the number of baby steps and close steps < i. [Hint: Show 
that the l’s in B z appear in consecutive blocks of size > 1 -j- <5ci.] (d) We now have 

l(ri) = r = b r +c r + and is(n) < i^(B r ) < (1 -\-8c r )2 br . Explain how to choose <5 in 

order to obtain the inequality stated at the beginning of this exercise. [Hint: See (16), 
and note that n < 2 r a br for some a < 1 depending on <L] 

29. [M49] Is t'(n) < x(n) for all positive integers n? (If so, we have the lower 
bound l(2 n — 1) > n — 1 -f j"lgn"|; cf. (17) and (49).) 

30. [20] An addition-subtraction chain has the rule di = dj i a k in place of (2); 
the imaginary computer described in the text has a new operation code, SUB. (This 
corresponds in practice to evaluating x n using both multiplications and divisions.) Find 
an addition-subtraction chain, for some n, that has fewer than l(n ) steps. 

31. [M 46 ] (D. H. Lehmer.) Explore the problem of minimizing eq -f- (r — q) in an 
addition chain (1), where q is the number of “simple” steps in which a % = a*_i + 1, 
given a small positive “weight” e. (This problem comes closer to reality for many 
calculations of x n , if multiplication by x is simpler than a general multiplication; cf. 
Section 4.6.2.) 

32. [MSO] (A. C. Yao.) Let l(m f ..., n m ) be the length of the shortest addition chain 
that contains m given numbers n\ < • • ■ < n m . Prove that l{n\ y , n m ) < X(n m ) -f- 
rnX(n m )/XX(n m ) 4* 0(X(n m )XXX(n m )/XX(n m ) 2 ), thereby generalizing (25). 

33. [M47] What is the asymptotic value of /(l, 4, 9,..., m 2 ) — m, as m —* 00 , in the 
notation of exercise 32? 

34. [M50] Is l(2 n — 1) < n — 1 A - K n ) f° r a E positive integers n? Does equality always 
hold? Does l(n) = l°(n)? 

35. [MSO] (A. C. Yao, F. F. Yao, R. L. Graham.) Associate the “cost” djdk with 
each step di = dj dk of an addition chain (1). Show that the left-to-right binary 
method yields a chain of minimum total cost, for all positive integers n. 

36. [15] How many addition chains of length 9 have (52) as their reduced directed 
graph? 
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37. [M23\ The binary addition chain for n — 2 e °4-j-2 et , when e 0 > • ■ ■ > e t > 0, 

is 1, 2, ..., 2 e °- ei , 2 e °- ei +l, ..., 2 e °- e2 +2 €l - e2 , 2 eo_e2 +2 ei “ e2 + l, ..., n. This 
corresponds to the S-and-X method described at the beginning of this section, while 
Algorithm A corresponds to the addition chain obtained by sorting the two sequences 
(1,2,4, ...,2 e °) and (2 et_1 2 et , 2 et ~~ 2 2 et ~ 1 -f 2 e ‘,...,n) into increasing order. 

Prove or disprove: Each of these addition chains is a dual of the other. 

38. [M27\ How many addition chains without useless steps are equivalent to each of 
the addition chains discussed in exercise 37, when eo > ei + 

► 39. [ M25 ] (J. Olivos, 1979.) Let Z([m, 112 ,..., n m ]) be the minimum number of mul¬ 
tiplications needed to evaluate the monomial • • ■ x m n i n the sense of exercise 27, 

where each n* is a positive integer. Prove that this problem is equivalent to the problem 
of exercise 32, by showing that l([ni, n 2 , . - -, n m j) = l{ni, n 2 , ■■■, n m ) + m — 1. [Hint: 
Generalize the directed graph construction by considering graphs with more than one 
source vertex.] 

► 40. [M21] (J. Olivos.) Generalizing the factor method, prove that 

l{m\n\ 4-b m t nt) < i(mi,... ,m t ) + i(ni,... ,n t ) + t — 1, 


where l(n\, ..., nt) is defined in exercise 32. 

41. [M40] (P. Downey, B. Leong, R. Sethi.) Let G be a connected graph with n 
vertices {1,..., n) and m edges, where the edges join u 3 to Vj for 1 < j < m. Prove 
that l{ 1 , 2,..., 2 An , 2 Aui + 2 Avi + 1 , ..., 2 AUm + 2 Avm + 1 ) = An + m + fc for all 
sufficiently large A, where k is the minimum number of vertices in a vertex cover for G 
(i.e., a set containing either u 3 or v 3 for 1 < j < m). 


4.6.4. Evaluation of Polynomials 

Now that we know efficient ways to evaluate the special polynomial x n , let us 
consider the general problem of computing an nth degree polynomial 

u{x) = u n x n + u n -ix n ~ l 4 -[~ U\X 4 ~ uq, u n j£ 0 , ( 1 ) 

for given values of x. This problem arises frequently in practice. 

In the following discussion we shall concentrate on minimizing the number of 
operations required to evaluate polynomials by computer, blithely assuming that 
all arithmetic operations are exact. Polynomials are most commonly evaluated 
using floating point arithmetic, which is not exact, and different schemes for 
the evaluation will, in general, give different answers. A numerical analysis of 
the accuracy achieved depends on the coefficients of the particular polynomial 
being considered, and is beyond the scope of this book; the reader should be 
careful to investigate the accuracy of any calculations undertaken with floating 
point arithmetic. In most cases the methods we shall describe turn out to be 
reasonably satisfactory from a numerical standpoint, but many bad examples 
can also be given. [See Webb Miller, SIAM J. Computing 4 (1975), 105-107, 
for a survey of the literature on stability of fast polynomial evaluation, and for 
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a demonstration that certain kinds of numerical stability cannot be guaranteed 
for some families of high-speed algorithms.] 

A beginning programmer will often evaluate the polynomial (1) in a manner 
corresponding directly to its conventional textbook form: First u n x n is calcu¬ 
lated, then u n — ix n ~ l , ..., U\X, and finally all of the terms of (1) are added 
together. But even if the efficient methods of Section 4.6.3 are used to evaluate 
powers of x in this approach, the resulting calculation is needlessly slow unless 
nearly all of the coefficients Uk are zero. If the coefficients are all nonzero, an 
obvious alternative would be to evaluate (1) from right to left, computing the 

values of x k and UkX k -|- \-Uq for k = 1, ..., n. Such a process involves 2 n — 1 

multiplications and n additions, and it might also require further instructions to 
store and retrieve intermediate results from memory. 

Homer’s rule. One of the first things a novice programmer is usually taught is 
an elegant way to rearrange this computation, by evaluating u(x) as follows: 

u(i) = ((...(u„i + u„_ i)x-f -)x + ii 0 . ( 2 ) 

Start with u n , multiply by x, add u n —\, multiply by i, ..., multiply by x, 
add uq. This form of the computation is usually called “Horner’s rule”; we have 
already seen it used in connection with radix conversion in Section 4.4. The 
entire process requires n multiplications and n additions, minus one addition 
for each coefficient that is zero. Furthermore, there is no need to store partial 
results, since each quantity arising during the calculation is used immediately 
after it has been computed. 

W. G. Horner gave this rule early in the nineteenth century [Philosophical 
Transactions, Royal Society of London 109 (1819), 308-335] in connection with 
a procedure for calculating polynomial roots. The fame of the latter method 
[see J. L. Coolidge, Mathematics of Great Amateurs (Oxford, 1949), Chapter 15] 
accounts for the fact that Horner’s name has been attached to (2); but actually 
Isaac Newton had made use of the same idea 150 years earlier. In a well-known 
work entitled De Analysi per JEquationes Inhnitas, originally written in 1669, 
Newton wrote 

y — 4 X y: + 5 Xy: — 12X2/: +17 

for the polynomial y 4 — 4 y 3 -f- 5 y 2 — 12y -J- 17; this clearly uses the idea of 
(2), since he often denoted grouping by using horizontal lines and colons instead 
of parentheses. [See D. T. Whiteside, ed., The Mathematical Papers of Isaac 
Newton 2 (Cambridge Univ. Press, 1968), 222.] 

Several generalizations of Horner’s rule have been suggested. Let us first 
consider evaluating u(z) when 2 is a complex number, while the coefficients Uk 
are real. In particular, when z — e l6 = cos# isin#, the polynomial u(z ) is 
essentially two Fourier series, 


(wo -f tii cos 6 -\ - \- u n cos nO) + i( u i sin 0 -\ -1 -u n sin nO). 
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Complex addition and multiplication can obviously be reduced to a sequence of 


ordinary operations on real numbers: 


real -j- complex 

requires 

1 addition 

complex + complex 

requires 

2 additions 

real X complex 

requires 

2 multiplications 

complex X complex 

requires 

or 

4 multiplications, 2 additions 
3 multiplications, 5 additions 


(See exercise 41. Subtraction is here considered as if it were equivalent to 
addition.) Therefore Horner’s rule (2) uses either 4n — 2 multiplications and 
3 n — 2 additions or 3n — 1 multiplications and 6 n — 5 additions to evaluate 
u(z) when z = x -\- iy is complex. Actually 2 n — 4 of these additions can be 
saved, since we are multiplying by the same number 2 each time. An alternative 
procedure for evaluating u(x -j- iy) is to let 

ai—u ni h = u n — 1 , r = x + x, s = x 2 + y 2 ; 

Qij - bj -1 I ?"Cij-lj I - 'U'Tl - j Sflj-1, 1 J (3) 

Then it is easy to prove by induction that u(z) = za n -f- b n . This scheme [BIT 
5 (1965), 142; cf. also G. Goertzel, AMM 65 (1958), 34-35] requires only 2n -f~ 2 
multiplications and 2n + 1 additions, so it is an improvement over Horner’s rule 
when n > 3. In the case of Fourier series, when 2 = e l6 , we have s = 1, so 
the number of multiplications drops to n -f- 1. The moral of this story is that 
a good programmer does not make indiscriminate use of the built-in “complex 
arithmetic” features of high-level programming languages. 

Consider the process of dividing the polynomial u(x) by x — xq, using 
Algorithm 4.6.1D to obtain u(x) = (x — Xo)q(x) + r(x); here deg(r) < 1, so r(x) 
is a constant independent of x, and u(x 0 ) = 0 • q(x 0 ) + r — r. An examination 
of this division process reveals that the computation is essentially the same as 
Horner’s rule for evaluating w(zo)- Similarly, if we divide u(z) by the polynomial 
(z — z 0 )(z — zo ) — z 2 — 2 XqZ -\-xl~\-yl, the resulting computation turns out to 
be equivalent to (3); we obtain u(z) = (z — zo)(z — z 0 )q(z) -f- a n z b n , hence 
u(zq) = a n z 0 -j- b n . 

In general, if we divide u(x) by /( x) to obtain u(x) = f(x)q(x) -j- r(x), 
and if f{x 0 ) = 0, we have u(x 0 ) = r(xo); this observation leads to further 
generalizations of Horner’s rule. For example, we may let f(x) = x 2 — x%; this 
yields the “second-order” Horner’s rule 


U(x) = ( . . . {u 2Vn/2 \X 2 + U 2 ln/2]-2)x 2 H-) X 2 + U 0 

+ ((••• (W 2 r n /2l-l ^ + U 2 fn/2l-3)^ 2 H- W + Ui) X. (4) 

The second-order rule uses n +1 multiplications and n additions (see exercise 5); 
so it is no improvement over Horner’s rule from this standpoint. But there are at 
least two circumstances in which (4) is useful: If we want to evaluate both u(x) 
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and u( — x), this approach yields u( — x ) with just one more addition operation; 
two values can be obtained almost as cheaply as one. Moreover, if we have a 
computer that allows parallel computations, the two lines of (4) may be evaluated 
independently, so we save about half the running time. 

When our computer allows parallel computation on k arithmetic units at 
once, a “fcth-order” Horner’s rule (obtained in a similar manner from f(x) = 
x k — £q ) may be used. Another attractive method for parallel computation has 
been suggested by G. Estrin [Proc. Western Joint Computing Conf. 17 (1960), 
33-40]; for n = 7, Estrin’s method is: 

Processor 1 Processor 2 Processor 3 Processor 4 Processor 5 

a l = UjX -j- Uq b\=UsX-\-U4 C\ = U^X ~\~ U2 d\ = U\X + Uq X 2 

a 2 = aix 2 + bi c 2 = Cix 2 4- di x A 

a 3 = a 2 x 4 + c 2 

Here a 3 = u(x). However, an interesting analysis by W. S. Dorn [IBM J. Res. 
and Devel. 6 (1962), 239-245] shows that these methods might not actually be 
an improvement over the second-order rule, if each arithmetic unit must access 
a memory that communicates with only one processor at a time. 

Tabulating polynomial values. If we wish to evaluate an nth degree polynomial 
at many points in an arithmetic progression (i.e., if we want to calculate u( xq), 
u(x o + to)j u(x 0 + 2 h), ...), the process can be reduced to addition only, after 
the first few steps. For if we start with any sequence of numbers (a 0 , <*i, • • •, a n ) 
and apply the transformation 

^ ^0 4" ^ ^1 I <^2? • * • j — 1 ^ Ot-n —1 “I - ^n> (5) 

we find that k applications of (5) yields 

a T = C>+( 1)^+1 + Q)a'+2 -i— < ° < j < n < 

where (3j denotes the initial value of cxj and / 3j = 0 for j > n. In particular, 

a ° k) = (o)^ 0 + (J)^ 1 +■"+ (6) 

is a polynomial of degree n in k. By properly choosing the /5’s, as shown in 
exercise 7, we can arrange things so that is the desired value u(xo + kh), 
for all k. In other words, each execution of the n additions in (5) will produce 
the next value of the given polynomial. 

Caution: Rounding errors can accumulate after many repetitions of (5), and 
an error in aj produces a corresponding error in the coefficients of x°, ..., x j 
in the polynomial being computed. Therefore the values of the a’s should be 
“refreshed” after a large number of iterations. 
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Derivatives and changes of variable. Sometimes we want to find the coefficients 
of u(x + x 0 ), given a constant x 0 and the coefficients of u(x). For example, if 
u(x) = 3x 2 2x — 1, then u(x — 2) — 3x 2 — 10a: + 7. This is analogous to a 
radix conversion problem, converting from base x to base x -f- 2. 

By Taylor’s theorem, the desired coefficients are given by the derivatives of 
u{x) at x — xoj namely 

u(x + x 0 ) = u(x 0 ) -f u'(x 0 )x + (u"(x 0 )/2!)x 2 -\ -(- (u (n) (x 0 )/n!)x n , (7) 

so the problem is equivalent to evaluating u(x) and all its derivatives. 

If we write u(x) — q(x)(x — Xo) + r, then u(x -f- Xo) — q(x + Xo)x r; 
so r is the constant coefficient of u(x -(- aco)> and the problem reduces to finding 
the coefficients of q{x + xo), where q(x) is a known polynomial of degree n — 1. 
Thus the following algorithm is indicated: 

HI. Set Vj <— Uj for 0 < j < n. 

H2. For k = 0, 1, ..., n— 1 (in this order), set Vj <— Vj -f-Zo^j+i for j — n — 1, 
..., fc + 1, fc (in this order). | 

At the conclusion of step H2 we have u(x -\- xq) — v n x n -\ -f- tqx -f- vq. This 

procedure was a principal part of Horner’s root-finding method, and when k — 0 
it is exactly rule (2) for evaluating u(x o). 

Horner’s method requires (n 2 + n)/2 multiplications and (n 2 -f n)/2 addi¬ 
tions; but notice that if x 0 = 1 we avoid all of the multiplications. Fortunately 
we can reduce the general problem to the case xq = 1 by introducing compara¬ 
tively few multiplications and divisions: 

51. Compute and store the values Xq, ..., Xq. 

52. Set Vj <— UjX J 0 for 0 < j < n. (Now v(x) = u(x ox).) 

53. Perform step H2 but with xo = 1. (Now v(x) = u(xo(x+l)) — u(xox-j-xo).) 

54. Set Vj <— Vj/x 3 0 for 0 < j < n. (Now v{x) = u(x -f- Xo) as desired.) | 

This idea, due to M. Shaw and J. F. Traub [ JACM 21 (1974), 161-167], has the 
same number of additions and the same numerical stability as Horner’s method, 
but it needs only 2 n — 1 multiplications and n — 1 divisions. About Jn of these 
multiplications can, in turn, be avoided (see exercise 6). 

If we want only the first few or the last few derivatives, Shaw and Traub 
have observed that there are further ways to save time. For example, if we just 
want to evaluate u(x) and u'(x), we can do the job with 2n — 1 additions and 
about n -j- \f2n multiplications/divisions as follows: 

Dl. Compute and store the values x 2 , x 3 , ..., x 4 , x 2t , where t = \y/n/ 2]. 

D2. Set Vj <— UjxfW for 0 < j < n, where f(j) = t — 1 — ((n — 1 — j) mod 2 1) 
for 0 < j < n, and f(n) = t. 

D3. Set Vj <— Vj + Vjjf.\x 9 ^ for j = n — 1, ..., 1, 0; here g(j) = 21 when n — j 
is a multiple of 2 1, othenvise g(j) = 0 and the multiplication by x 9 ^ need 
not be done. 

D4. Set Vj «— Vj -f Vj+\x 9 ^ for j — n — 1, ..., 2, 1. Now vq/x = u(x) and 
Vi/x f ^ = u'(x). | 
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Adaptation of coefficients. Let us now return to our original problem of eval¬ 
uating a given polynomial u(x) as rapidly as possible, for “random” values of x. 
The importance of this problem is due partly to the fact that standard functions 
such as sinx, cosx, e x , etc., are usually computed by subroutines that rely on 
the evaluation of certain polynomials; such polynomials are evaluated so often, 
it is desirable to find the fastest possible way to do the computation. 

Arbitrary polynomials of degree five and higher can be evaluated with fewer 
operations than Horner’s rule requires, if we first “adapt” or “precondition” 
the coefficients Uo, ui, ..., u n . This adaptation process might involve a lot of 
work, as explained below; but the preliminary calculation is not wasted, since 
it must be done only once while the polynomial will be evaluated many times. 
For examples of “adapted” polynomials for standard functions, see V. fa. Pan, 
USSR Computational Math, and Math. Physics 2 (1963), 137-146. 

The simplest case for which adaptation of coefficients is helpful occurs for a 
fourth degree polynomial: 

u(x) = u 4 x 4 u 3 x 3 u 2 x 2 -\-uix + u 0 , u 4 7^ 0 . ( 8 ) 

This equation can be rewritten in a form originally suggested by T. S. Motzkin, 

y = (z + oto)x + Q£ X , u(x) = ({y + x -f a 2 )y + a 3 )a 4 , (9) 

for suitably “adapted” coefficients ocq, aq, a 2) ol 3 , a 4 . The computation in (9) 
involves three multiplications, five additions, and (on a one-accumulator machine 
like MIX) one instruction to store the partial result y into temporary storage. By 
comparison with Horner’s rule, we have traded a multiplication for an addition 
and a possible storage command. Even this comparatively small savings is 
worthwhile if the polynomial is to be evaluated often. (Of course, if the time for 
multiplication is comparable to the time for addition, ( 9 ) gives no improvement; 
we will see that a general fourth-degree polynomial always requires at least eight 
arithmetic operations for its evaluation.) 

By equating coefficients in ( 8 ) and (9), we obtain formulas for computing the 
ay’s in terms of the uC s: 

<*0 = i(u 3 /u 4 — 1 ), P = u 2 /u 4 — a 0 (a 0 + 1 ), ai = u x /u 4 — a 0 /3, 
a 2 = (3 — 2a 1 , a 3 = u 0 /u 4 — ai(a 4 + a 2 ), a 4 == u 4 . 

( 10 ) 

A similar scheme, which evaluates a fourth-degree polynomial in the same num¬ 
ber of steps as (9), appears in exercise 18; this alternative method will give greater 
numerical accuracy than (9) in certain cases, although it yields poorer accuracy 
in others. 

Polynomials that arise in practice often have a rather small leading coeffi¬ 
cient, so that the division by u 4 in (10) leads to instability. In such a case it 
is usually preferable to replace x by \u 4 \ l ^ A x as the first step, reducing ( 8 ) to a 
polynomial whose leading coefficient is ±1. A similar transformation applies to 
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polynomials of higher degrees. This idea is due to C. T. Fike [CACM 10 (1967), 
175-178], who has presented several interesting examples. 

Any polynomial of the fifth degree may be evaluated using four multiplica¬ 
tions, six additions, and one storing, by using the rule u(x) = U(x)x-\-uq, where 
U{x) = u 5 x 4 -j- u 4 x 3 -j- U 3 X 2 -|- U 2 X -f Ui is evaluated as in (9). Alternatively, 
we can do the evaluation with four multiplications, five additions, and three 
storings, if the calculations take the form 

V = { x + a 0 ) 2 , u{x) = (((y + ai )y + a 2 ){x + a 3 ) + a 4 )a 5 . (11) 

The determination of the a’s this time requires the solution of a cubic equation 
(see exercise 19). 

On many computers the number of “storing” operations required by (11) 
is less than 3; for example, we may be able to compute (x -f- ao) 2 without 
storing x-f-OQ. In fact, many computers have more than one arithmetic register 
for floating point calculations, so we can avoid storing altogether. Because of 
the wide variety of features available for arithmetic on different computers, we 
shall henceforth in this section count only the arithmetic operations, not the 
operations of storing and loading an accumulator. The computation schemes 
can usually be adapted to any particular computer in a straightforward manner, 
so that very few of these auxiliary operations are necessary; on the other hand, 
it must be remembered that this extra overhead might well overshadow the fact 
that we are saving a multiplication or two, especially if the machine code is being 
produced by a compiler that does not “optimize.” 

A polynomial u(x) = u^x 6 -f • • • + u 4 x -f- u 0 of degree six can always be 
evaluated using four multiplications and seven additions, with the scheme 

2 = (x + a 0 )x + ai, w = (x + ol 2 )z + a 3 , 

u(x) — ((w + 2 + ot 4 )w + a 5 )a 6 . (12) 

[See D. E. Knuth, CACM 5 (1962), 595-599.] This saves two of the six multi¬ 
plications required by Horner’s rule. Here again we must solve a cubic equation: 
Since 06 = Uq, we may assume that uq = 1. Under this assumption, let j3\ = 

4(^5 — 1), #2 = /5i(/5i-|-1), /?3 = U 3 —P 1 P 2 , (3a — Pi — P 21 (3b = ^2 (3\Ps- 

Let /?6 be a real root of the cubic equation 

2 y 3 + (20 4 -0 2 + 1 )y 2 + (2ft - 0204 ~ 0 3 )V + (tti - Aft) = 0. (13) 

(This equation always has a real root, since the polynomial on the left approaches 
-f-oo for large positive y, and it approaches — 00 for large negative y; it must 
assume the value zero somewhere in between.) Now if we define 

(3j = (3l~\- (3 4 (3^ 4- (3b, (3s = P 3 — (3s — (3 j, 

we have finally 

&'0 — 02 — 2 /? 6 , 0(2 — (31 — Oo> a 1 = (3$ — 

ots == (3j — 011 OL 2 , 04 = /3 8 — (3-j — oci, 0:5 — Uq — (3j (3g. (14) 
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We can illustrate this procedure with a contrived example: Suppose that we 
want to evaluate x e + 13x 5 -}-4§x A -f-33x 3 — 61x 2 — 37x + 3. We obtain a e = 1, 
= 6 , P 2 = 7 , /?3 = —9, (3 4 — —1, (3$ = —7, and so we meet with the cubic 
equation 

2y 3 -Sy 2 + 2y + 12 = 0. (15) 

This equation has (3q — 2 as a root, and we continue to find (3j = —5, (3g — — 6 , 
ctQ = 3, a 2 — 3, «i — —7, a 3 = 16, a 4 = 6 , a 5 = —27. The resulting scheme 
is therefore 

z ~ (x- 1- 3)x — 7, w = (x 3 )z -j- 16, u(x) = (wz 6 )u> — 27. 

By sheer coincidence the quantity x -f- 3 appears twice here, so we have found a 
method that uses three multiplications and six additions. 

Another method for handling sixth-degree equations has been suggested by 
V. fa. Pan [Problemy Kibernetiki 5 (1961), 17-29]. His method requires one more 
addition operation, but it involves only rational operations in the preliminary 
steps; no cubic equation needs to be solved. We may proceed as follows: 

z = (x -j- ao)x -f- ai, w = z -(- x -f- 

u(x) — (((2 — X + -f a A )z -f 05)^6- ( 16 ) 

To determine the o’s, we divide the polynomial once again by Uq = Q !6 so that 
u{x) becomes monic. It can then be verified that ao ~ us/3 and that 

Gq = ( U\ — 00^2 ~f" &o u 3 — &o w 4 2a^)/{ug — 2olqu 4 Sag)* (17) 

Note that Pan’s method requires that the denominator in (17) does not vanish. 
In other words, (16) can be used only when 

27ii3^e — I&UQU 5 U 4 5 ul 7 ^ 0 ; (18) 

in fact, this quantity should not be so small that ai becomes too large. Once a\ 
has been determined, the remaining a's may be determined from the equations 

/ 3 1 = 2 ao, (32 = 14 — &0P1 — ol 1, 

(3$ — OLQ 02 — ai/3i, (3 a = ^2 — aoAs — <^i /?2 ? 

a 3 = _ ( a ° _ W 2 “f ( a ° - l)( a 0 - 1)) a l> 

Oi2 = (32 — (<* 0 — 1 ) — ft 3 — 2oi, 

Oi-A — (3 a — («2 -|- &i)(a 3 + ^i)? 
a 5 =u 0 — ol\(3a- 

We have discussed the cases of degree n = 4, 5 , 6 in detail because the 
smaller values of n arise most frequently in applications. Let us now consider 
a general evaluation scheme for nth degree polynomials, a method that involves 
at most [n/ 2 j + 2 multiplications and n additions. 
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Theorem E. Every nth degree polynomial (1) with real coefficients, n > 3, can 
be evaluated by the scheme 

y = x-\-c, w = y 2 ; 

^ = ( u n y + c*o)y Hh Po (n even), z = u n y + Po (n odd); 
u(x) — (... (0(w — or) + pi){w — o 2 ) + /? 2 ) • • • )(w — a m ) + Pm', (20) 

for suitable real parameters c, ct^ and Pk, where m = |_n/2j — 1. In fact, it is 
possible to select these parameters so that pm = 0. 

Proof. Let us first examine the circumstances under which the a’s and (3' s can 
be chosen in (20), if c is fixed. Let 

p(x) — u(x — c) = a n X n + On— lX n ~ l -\ -+ a l x 4“ a 0 - (21) 

We want to show that p(x) has the form pi(x)(x 2 — o m )+/^m for some polynomial 
Pi(x) and some constants a m , p m . If we divide p(x) by x 2 — a m , we can see that 
the remainder p m is a constant only if the auxiliary polynomial 

g(x) — 02m+i2 ;7n “h 02m—i^ m 1 4" * • * 4“ Oi, (22) 

formed from every odd-numbered coefficient of p(x), is a multiple of x — a m - 
Conversely, if q(x) has x — a m as a factor, then p(x) — p\(x){x 2 — a m ) 4- Pm > 
for some constant p m that may be determined by division. 

Similarly, we want pi(x) to have the form p 2 {x)(x 2 — a m — 1 ) ~b Pm— 1 , and 
this is the same as saying that q(x)/(x — a m ) is a multiple of x — a m —i; for if 
qi{x) is the polynomial corresponding to pi{x) as q(x) corresponds to p(x), we 
have qi(x) = q(x)/(x — a m ). Continuing in the same way, we find that the 
parameters a\, Pi, ..., a m , p m will exist if and only if 

q(x) = a 2 m + i(x —ai)...(x —a m ). (23) 

In other words, either q(x) is identically zero (and this can happen only when n 
is even), or else q(x) is an mth degree polynomial having all real roots. 

Now we have a surprising fact discovered by J. Eve [Numer. Math. 6 (1964), 
17-21]: If p(x) has at least n — 1 complex roots whose real parts are all nonnega¬ 
tive, or all nonpositive, then the corresponding polynomial q(x) is identically zero 
or has all real roots. (See exercise 23.) Since u(x) = 0 if and only if p(x-\-c) = 0, 
we need merely choose the parameter c large enough that at least n — 1 of the 
roots of u(x) = 0 have a real part > — c, and (20) will apply whenever a n _i = 
u n —j ncu n 0. 

We can also determine c so that these conditions are fulfilled and also that 
p m = 0. First the n roots of u(x) = 0 are determined. If a 4~ bi is a root 
having the largest or the smallest real part, and if b 0, let c = — a and 
a m = —6 2 ; then x 2 — a m is a factor of u(x — c). If the root with smallest or 
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largest real part is real, but the root with second smallest (or second largest) real 
part is nonreal, the same transformation applies. If the two roots with smallest 
(or largest) real parts are both real, they can be expressed in the form a — b 
and a b, respectively; let c — —a and a m = b 2 . Again x 2 — a m is a factor 
of u(x — c). (Still other values of c are often possible; see exercise 24.) The 
coefficient a n _i will be nonzero for at least one of these alternatives, unless q(x) 
is identically zero. | 

Note that this method of proof usually gives at least two values of c, and 
we also have the chance to permute ot\, ..., a m _i in (m — 1)! ways. Some of 
these alternatives may give more desirable numerical accuracy than others. 

*Polynomial chains. Now let us consider questions of optimality. What are the 
best possible schemes for evaluating polynomials of various degrees, in terms 
of the minimum possible number of arithmetic operations? This question was 
first analyzed by A. M. Ostrowski in the case that no preliminary adaptation 
of coefficients is allowed [Studies in Mathematics and Mechanics presented to 
R. von Mises (New York: Academic Press, 1954), 40-48], and by T. S. Motzkin 
in the case of adapted coefficients [cf. Bull. Amer. Math. Soc. 61 (955), 163]. 

In order to investigate this question, we can extend Section 4.6.3’s concept 
of addition chains to the notion of polynomial chains. A polynomial chain is a 
sequence of the form 


x = X 0 , Xi, ..., X r = u{x), 

where u(x) is some polynomial in x, and for 1 < i < r 

either X* = {±\j) o \ k , 0 < j, k < i, 

or \ — otj o \ k , 0 < k < i. 


(24) 


(25) 


Here “o” denotes any of the three operations or “X”, and ct 3 denotes 

a so-called parameter. Steps of the first kind are called chain steps, and steps 
of the second kind are called parameter steps. We shall assume that a different 
parameter a 3 is used in each parameter step; if there are s parameter steps, they 
should involve au, a 2 , ..., a s in this order. 

It follows that the polynomial u(x) at the end of the chain has the form 


u(x) — q n x n + * * ■ + QiX qo , (26) 

where q n , ..., qi, qo are polynomials in or, a 2 , ..., with integer coefficients. 
We shall interpret the parameters Qq, a 2 , ..., a s as real numbers, and we shall 
therefore restrict ourselves to considering the evaluation of polynomials with 
real coefficients. The result set R of a polynomial chain is defined to be the 
set of all vectors ( q n ,..., qi, qo) of real numbers that occur as or, a 2 , • • •, 
independently assume all possible real values. 

If for every choice of i -[— 1 distinct integers jo, ..., jt £ {0,1,..., n} there 
is a nonzero multivariate polynomial f 3o ... 3t with integer coefficients such that 
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/jo -.jiteoJ • • • > Qjt) = 0 f° r (<7n» • • • > <7i, <7o) in R, let us say that the result 
set R has at most t degrees of freedom, and that the chain (24) has at most t 
degrees of freedom. We also say that the chain (24) computes a given polynomial 
u(x) = u n x n -)-••• Uix -f- uq if (u n ,..., U\, Uq) is in R. It follows that a 
polynomial chain with at most n degrees of freedom cannot compute all nth 
degree polynomials (see exercise 27). 

As an example of a polynomial chain, consider the following chain cor¬ 
responding to Theorem E, when n is odd: 

X 0 = x 

Xi = c*i -f- \o 

X2 = Xi X Xi 
X3 — a 2 X Xi 
X1+3 i = ®l+21 + X 3l 
X 2 —(—= 0^2 + 22 T“ X2 

X3—(— 3 i — Xi_(_ 3 z X X 2 _(_ 3 x 
There are [n/2j -f- 2 multiplications and n additions; [n/2j -|- 1 chain steps and 
n -f-1 parameter steps. By Theorem E, the result set R includes the set of all 
(u n ,..., Ui,u 0 ) with u n 7 ^ 0, so (27) computes all polynomials of degree n. We 
cannot prove that R has at most n degrees of freedom, since the result set has 
n -f -1 independent components. 

A polynomial chain with s parameter steps has at most s degrees of freedom. 
In a sense, this is obvious: we can’t compute a function with t degrees of freedom 
using fewer than t arbitrary parameters. But this intuitive fact is not easy 
to prove formally; for example, there are continuous functions (“space-filling 
curves”) that map the real line onto a plane, and such functions map a single 
parameter into two independent parameters. For our purposes, we need to verify 
that no polynomial functions with integer coefficients can have such a property; 
a proof appears in exercise 28. 

Given this fact, we can proceed to prove the results we seek: 

Theorem M (T. S. Motzkin, 1954). A polynomial chain with m > 0 multiplica¬ 
tions has at most 2 m degrees of freedom. 

Proof. Let pi, P 2 , •••, Pm be the X 2 ’s of the chain that are multiplication 
operations. Then 

Pi — S 2 i—i X S 2z , 1 < i < m, 

u{x) = S 2m +i, ( 28 ) 

where each S 3 is a certain sum of p’s, x’s, and o’s. Write S 3 = T 3 -|- fi 3 , where 
T 3 is a sum of p’s and x’s while (3 3 is a sum of o’s. 

Now u(x) is expressible as a polynomial in x, / 3 i, ..., (3 2m yi with integer 
coefficients. Since the /Ts are expressible as linear functions of Oi, ..., a s , the 
set of values represented by all real values of (3i, ..., (3 2m +i contains the result 
set of the chain. Therefore there are at most 2m -j- 1 degrees of freedom; this 
can be improved to 2 m when m > 0 , as shown in exercise 30. | 


1 < i < n/2. 


(27) 
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An example of the construction in the proof of Theorem M appears in 
exercise 25. A similar result can be proved for additions: 

Theorem A (E. G. Belaga, 1958). A polynomial chain containing q additions 
and subtractions has at most q -f 1 degrees of freedom. 

Proof. [Problemi Kibernetiki 5 (1961), 7-15.] Let Ki, ..., K q be the X*’s of the 
chain that correspond to addition or subtraction operations. Then 

Ki — i T 2i , 1 < i < q, 

u(x) = T 2q +i, (29) 

where each Tj is a product of /c’s, z’s, and a’s. We may write Tj = AjBj, 
where A 3 is a product of a’s and B 3 is‘a product of /c’s and x’s. The following 
transformation may now be made to the chain, successively for i — 1 , 2 , ... , q: 
Let ^ — A 2l /A 2i — i, so that Ki = A 2 i—i(zLB 2 i—i±PiB 2 i). Then change Ki to 
±. 821—1 ± PiB 2 i, and replace each occurrence of Ki in future formulas T 2i + 1} 
T 2 i + 2 , ..., T 2q +i by A 2 i—iKi. (This replacement may change the values of 
A 2 i+U A 2 i+ 2 , . . . , A 2 q+1.) 

After the above transformation has been done for all i, let P q +\ = A 2q +i; 
then u(x) can be expressed as a polynomial in Pi, ..., P q +i, and x, with integer 
coefficients. We are almost ready to complete the proof, but we must be careful 
because the polynomials obtained as Pi, ..., A?+i range over all real values may 
not include all polynomials representable by the original chain (see exercise 26); 
it is possible to have A 2l —\ = 0 , for some values of the a’s, and this makes pi 
undefined. 

To complete the proof, let us observe that the result set R of the original 
chain can be written R = Ri (J R 2 U • • • U R q U R', where Ri is the set of 
result vectors possible when A 2l —1 = 0, and where R' is the set of result vectors 
possible when all a’s are nonzero. The discussion above proves that R' has at 
most q -f- 1 degrees of freedom. If A 2l —1 = 0 , then T 2l —\ = 0, so addition 
step Ki may be dropped to obtain another chain computing the result set R^ 
by induction we see that each Ri has at most q degrees of freedom. Hence by 
exercise 29, R has at most q + 1 degrees of freedom. | 

Theorem C. If a polynomial chain (24) computes all nth degree polynomials 
u(x) = u n x n -(-••• u o, for some n > 2, then it includes at least [n/2\ + 1 
multiplications and at least n addition-subtractions. 

Proof. Let there be m multiplication steps. By Theorem M, the chain has at 
most 2 m degrees of freedom, so 2m > n + L Similarly, by Theorem A there 
are > n addition-subtractions. | 

This theorem states that no single method having fewer than [n/2j + 1 
multiplications or fewer than n additions can evaluate all possible nth degree 
polynomials. The result of exercise 29 allows us to strengthen this and say that 
no finite collection of such polynomial chains will suffice for all polynomials of 
a given degree. Some special polynomials can, of course, be evaluated more 
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efficiently; all we have really proved is that polynomials whose coefficients are 
algebraically independent, in the sense that they satisfy no nontrivial polynomial 
equation, require |_n/2j -f- 1 multiplications and n additions. Unfortunately the 
coefficients we deal with in computers are always rational numbers, so the above 
theorems don’t really apply; in fact, exercise 42 shows that we can always get by 
with 0(y/n) multiplications (and a possibly huge number of additions). From a 
practical standpoint, the bounds of Theorem C apply to “almost all” coefficients, 
and they seem to apply to all reasonable schemes for evaluation. Furthermore 
it is possible to obtain lower bounds corresponding to those of Theorem C even 
in the rational case: By strengthening the above proofs, V. Strassen has shown, 
for example, that the polynomial 

u(x ) = ^ 2 2 * n x k (30) 

0<fc<n 

cannot be evaluated by any polynomial chain of length < n 2 /\gn unless the 
chain has at least \n — 2 multiplications and n — 4 additions [SIAM J. Com¬ 
puting 3 (1974), 128-149]. The coefficients of (30) are very large; but it is 
also possible to find polynomials whose coefficients are just 0’s and l’s, such 
that every polynomial chain computing them involves at least y/n/(A\gn) chain 
multiplications, for all sufficiently large n, even when the parameters a 3 are 
allowed to be arbitrary complex numbers. [See R. J. Lipton, SIAM J. Computing 
7 (1978), 61-69; C.-P. Schnorr, Lecture Notes in Comp. Sci. 53 (1977), 135-147.] 
Jean-Paul Van de Wiele has shown that the evaluation of certain 0-1 polynomials 
requires a total of at least cn/logn arithmetic operations, for some c > 0 [Proc. 
IEEE Symp. Foundations of Comp. Sci. 19 (1978), 159-165]. 

A gap still remains between the lower bounds of Theorem C and the actual 
operation counts known to be achievable, except in the trivial case n = 2. 
Theorem E gives |_n/2j + 2 multiplications, not |_n/2j -\- 1, although it does 
achieve the minimum number of additions. Our special methods for n = 4 and 
n = 6 have the minimum number of multiplications, but one extra addition. 
When n is odd, it is not difficult to prove that the lower bounds of Theorem C 
cannot be achieved simultaneously for both multiplications and additions; see 
exercise 33. For n = 3, 5, and 7, it is possible to show that at least [n/2\ 2 

multiplications are necessary. Exercises 35 and 36 show that the lower bounds 
of Theorem C cannot both be achieved when n = 4 or n = 6; thus the methods 
we have discussed are best possible, for n < 8. When n is even, Motzkin proved 
that [rz/2j -j- 1 multiplications are sufficient, but his construction involves an 
indeterminate number of additions (see exercise 39). An optimal scheme for 
n — 8 was found by V. Ia. Pan, who showed that n -j- 1 additions are necessary 
and sufficient for this case when there are [n/2\ -|- 1 multiplications; he also 
showed that [n/ 2j-(-l multiplications and n-f- 2 additions will suffice for all even 
n > 10. Pan’s paper [Proc. ACM Symp. Theory of Computing 10 (1978), 162— 
172] also establishes the exact minimum number of multiplications and additions 
needed when calculations are done entirely with complex numbers instead of 
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reals, for all degrees n. Exercise 40 discusses the interesting situation that arises 
for odd values of n > 9. 

It is clear that the results we have obtained about chains for polynomials in 
a single variable can be extended without difficulty to multivariate polynomials. 
For example, if we want to find an optimum scheme for polynomial evaluation 
without adaptation of coefficients, we can regard u(x) as a polynomial in the 
n -f- 2 variables x, u n , ..., uU q\ exercise 38 shows that n multiplications and 
n additions are necessary in this case. Indeed, A. Borodin [Theory of Machines 
and Computations , ed. by Z. Kohavi and A. Paz (New York: Academic Press, 
1971), 45-58] has proved that Homer’s rule (2) is essentially the only way to 
compute u(x) in 2n operations without preconditioning. 

With minor variations, the above methods can be extended to chains in¬ 
volving division, i.e., to rational functions as well as polynomials. Curiously, the 
continued-fraction analog of Horner’s rule now 7 turns out to be optimal from an 
operation-count standpoint, if multiplication and division speeds are equal, even 
when preconditioning is allowed (see exercise 37). 

Sometimes division is helpful during the evaluation of polynomials, even 
though polynomials are defined only in terms of multiplication and addition; 
we have seen examples of this in the Shaw-Traub algorithms for polynomial 
derivatives. Another example is the polynomial 


X n - 1 _ £ 1 ; 


since this polynomial can be written (z^ 1 — l)/(x — 1), we can evaluate it with 
l(n -j- 1) multiplications (see Section 4.6.3), two subtractions, and one division, 
while techniques that avoid division seem to require about three times as many 
operations (see exercise 43). 

Special multivariate polynomials. The determinant of an n X n matrix may be 
considered to be a polynomial in n 2 variables x l3 , 1 < i,j < n. If xu 0, we 
have 


f%\\ Xl2 • - ■ Iln\ 
I CC22 ■ • ■ X2n 1 


det 


Z 31 


x 32 - ■ • £ 3 n 


Xu det 


'Z22 — (x 2 l/Xil)Xi2 

x 32 — (Xzi/X 1 i)x 1 2 


£2n—(X2i/xii)a;i 
X3n —(X 3 l/Xll)Xl 



\£ n i ^n2 • ■ • 


^X n 2 (Xnl/Xn)Xl2 ••• Xnn — (x n i/Xn)Xi, 


(31) 


The determinant of an n X n matrix may therefore be evaluated by evaluating 
the determinant of an (n — 1) X (n — 1) matrix and performing an additional 
(n — l) 2 -f 1 multiplications, (n — l) 2 additions, and n — 1 divisions. Since a 
2x2 determinant can be evaluated with two multiplications and one addition, 
we see that the determinant of almost all matrices (namely those for which no 
division by zero is needed) can be computed with at most (2n 3 — Zn 2 -\-ln — 6)/6 
multiplications, (2n 3 — 3n 2 -j- n)/6 additions, and (n 2 — n — 2)/2 divisions. 
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When zero occurs, the determinant is even easier to compute. For example, 
if Xu — 0 but X 21 7 ^ 0 , we have 


/ 0 X 12 ■ ■ • llO 
X 2 1 X 2 2 ••• X 3 n 

det X32 .. . X3n — —X21 det 

V^-nl X n 2 ■ • ■ Inn/ 


X12 • • • Iln 

232— (z 3 l/X2l)X22 X 3n—(X3i/X2l)x 2r 

X r i2 (^nl/^2l)^22 * ■ * ^nn (^n 1 / X 2 i')X 2 r 


Here the reduction to an (n — 1) X (n — 1) determinant saves n — 1 of the multi¬ 
plications and n — 1 of the additions used in (31), and this certainly compensates 
for the additional bookkeeping required to recognize this case. Therefore any 
determinant can be evaluated with roughly § n 3 arithmetic operations (including 
division); this is remarkable, since it is a polynomial with n! terms and n variables 
in each term. 

If we want to evaluate the determinant of a matrix with integer elements, 
the above process appears to be unattractive since it requires rational arithmetic. 
However, we can use the method to evaluate the determinant mod p, for any 
prime p, since division mod p is possible (exercise 4.5.2-15). If this is done for 
sufficiently many primes p, the exact value of the determinant can be found as 
explained in Section 4.3.2, since Hadamard’s inequality (4.6.1-25) gives an upper 
bound on the magnitude. 

The coefficients of the characteristic polynomial det (xl — X) of an n X n 
matrix X can also be computed in 0(n 3 ) steps; cf. J. H. Wilkinson, The Algebraic 
Eigenvalue Problem (Oxford: Clarendon Press, 1965), 353-355, 410-411. 

The permanent of a matrix is a polynomial that is very similar to the 
determinant; the only difference is that all of its nonzero coefficients are +1. 
Thus we have 

( X\\ ... X\ n \ 

: : I = y: x iji x 2 j 2 ■ • • %nj n j (33) 

X n l • • • Inn/ 

summed over all permutations ji j 2 .. .j n of {1, 2,..., n}. It would seem that this 
function should be even easier to compute than its more complicated-looking 
cousin, but no way to evaluate the permanent as efficiently as the determinant 
is known. Exercises 9 and 10 show that substantially fewer than n\ operations 
will suffice, for large n, but the execution time of all known methods still grows 
exponentially with the size of the matrix. In fact, Leslie G. Valiant has shown 
that it is as difficult to compute the permanent of a given 0-1 matrix as it is to 
count the number of accepting computations of a nondeterministic polynomial¬ 
time Turing machine, if we ignore polynomial factors in the running time of the 
calculation. Therefore a polynomial-time evaluation algorithm for permanents 
would imply that scores of other well known problems that have resisted efficient 
solution would be solvable in polynomial time. On the other hand, Valiant proved 
that the permanent of an n X n integer matrix can be evaluated modulo 2 fc in 
0(n Ak ~ 3 ) steps for all k > 2. [See Theoretical Comp. Sci. 8 (1979), 189-201.] 
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Another fundamental operation involving matrices is, of course, matrix 
multiplication: If X = (xi 3 ) is an m X n matrix, Y = {y 3 k) is an n X s matrix, 
and Z = (zik) is an m X s matrix, then the formula Z = XY means that 

Zik — ^ ^ Djki 1 ^ 7Yly 1 ^ /c ^ S. (34) 

1 <j<n 

This equation may be regarded as the computation of ms simultaneous polyno¬ 
mials in mn-f-ns variables; each polynomial is the “inner product” of two n-place 
vectors. A straightforward calculation would involve mns multiplications and 
ms(n — 1) additions; but S. Winograd discovered in 1967 that there is a way to 
trade about half of the multiplications for additions: 


%ik ^ ^ (^i,2 j ~~b V2j — —1 H~ V2j,k) ®i ^k “b ^ikj 

l<j<n/2 

&i ^ ^ %i,2j %i,2j —1? h ^ ^ V 2j — l,ky 2j ,ki 

l<j<n/2 1 < J < n / 2 


Cik — 


= {°> 

[Xi 


, n even; 

inVnk, n odd. 


This scheme uses \n/2]ms -J- [n/2\(m -f- s) multiplications and (n -f- 2)ms -f- 
([n/2j — l)(ms + m -f- s) additions or subtractions; the total number of opera¬ 
tions has increased slightly, but the number of multiplications has roughly been 
halved. [See IEEE Trans. C-17 (1968), 693-694.] Winograd’s surprising con¬ 
struction led many people to look more closely at the problem of matrix multi¬ 
plication, and it touched off widespread speculation that n 3 /2 multiplications 
would be necessary to multiply nXn matrices, because of the somewhat similar 
lower bound that was known to hold for polynomials in one variable. 

An even better scheme for large n was discovered by Volker Strassen in 
1968; he found a way to compute the product of 2 X 2 matrices with only 
seven multiplications, without relying on the commutativity of multiplication as 
in (35). Since 2 n X 2 n matrices can be partitioned into four n X n matrices, 
his idea can be used recursively to obtain the product of 2 k X 2 k matrices with 
only l k multiplications instead of (2 fc ) 3 = 8 fc . The number of additions also 
grows as order 7 k . Strassen’s original 2X2 identity [IVumer. Math. 13 (1969), 
354-356] used 7 multiplications and 18 additions; S. Winograd later discovered 
the following more economical formula: 


fa b 
\c d 


A C\ 
B D) 


aA + bB w + (c+d)(C— A) + {a+b—c—d)D 

w + ( a—c){D—C ) — d{A—B—C+D) w + ( a—c){D—C ) + (c+d)(C— A) 


where w = aA— ( a~~c-d)(A—C-\-D ). (36) 
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If intermediate results are appropriately saved, (36) involves 7 multiplications 
and only 15 additions; by induction on k, we can multiply 2 k X 2 k matrices with 
l k multiplications and 5(7 fc — 4 fc ) additions. The total number of operations 
needed to multiply n X n matrices has therefore been reduced from order n 3 
to 0(n lg7 ) — 0(n 2 - 8074 ). A similar reduction applies also to the evaluation of 
determinants and matrix inverses; cf. J. R. Bunch and J. E. Hopcroft, Math. 
Comp. 28 (1974), 231-236. 

Strassen’s exponent lg 7 resisted numerous attempts at improvement until 
1978, when Viktor Pan discovered that it could be lowered to log 70 143640 ^ 
2.795 (see exercise 60). This new breakthrough led to further intensive analysis 
of the problem, and the combined efforts of D. Bini, M. Capovani, G. Lotti, F. 
Romani, A. Schonhage, V. Pan, S. Winograd, and D. Coppersmith culminated 
in constructions that have an asymptotic running time of 0(n 2-5161 ). Exercises 
60-64 discuss some of the interesting techniques that were used to derive this 
bound; a full account of the developments has been prepared by V. fa. Pan, 
Computers and Math., to appear. 

These theoretical results are quite striking, but from a practical standpoint 
they are of limited use because n must be very large before we overcome the effect 
of additional bookkeeping costs. Richard Brent [Stanford Computer Science 
report CS157 (March, 1970), see also Numer. Math. 16 (1970), 145-156] found 
that a careful implementation of Winograd’s scheme (35), with appropriate 
scaling for numerical stability, became better than the conventional method only 
when n > 40, and it saved only about 7 percent of the running time when 
n — 100. For complex arithmetic the situation was somewhat different; (35) 
became advantageous for n > 20, and saved 18 percent when n = 100. He 
estimated that Strassen’s scheme would not begin to excel over (35) until n « 
250; and such enormous matrices, containing more than 60,000 entries, rarely 
occur in practice (unless they are very sparse, when other techniques apply). 
Furthermore, the known methods of order n& where (3 < 2.7 have such large 
constants of proportionality that they require more than 10 23 multiplications 
before they start to beat Strassen’s scheme. 

By contrast, the methods we shall discuss next are eminently practical and 
have found wide use. The discrete Fourier transform f of a complex-valued 
function F of n variables, over respective domains of mi, ..., m n elements, is 
defined by the equation 


/(Si,...,S n ) — ^ exp ^271 -\- ••• -j - • • * 5 tn) 

V \mi rn n JJ 


0 ^ t n 77T n 

for 0 < Si < mj, ..., 0 < s n < m n ; the name “transform” is justified because 
we can recover the values F(ti, ..., t n ) from the values /(si, ..., s n ), as shown 
in exercise 13. In the important special case that all m 3 = 2, we have 


/(«i,■•■,«„)= £ (—l) Sltl_l , t n ) (38) 
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for 0 < Si,..., s n < 1, and this may be regarded as a simultaneous evaluation 
of 2 n linear polynomials in 2 n variables F(ti,..., t n ). A well-known technique 
due to F. Yates [The Design and Analysis of Factorial Experiments (Harpenden: 
Imperial Bureau of Soil Sciences, 1937)] can be used to reduce the number 
of additions implied in (38) from 2 n (2 n — 1) to n2 n . Yates’s method can be 
understood by considering the case n = 3: Let x tl t 2 t 3 = F(ti,t 2 , £3). 

Given First step Second step Third step 

£000 Xooo-I-Zooi XOOo 4 “Xooi YXoioYXoi 1 X00O “t - Xoo 1 “I - Xo 10 x 011 x 100 “f” X101 —X11 0 ~I - X1 11 

Xooi X010 + X011 ^10oH-Xl01+XllO-(-Xl 11 Xooo-Xooi+Xoio-Xoil+Xioo-Xioi+Xno-Xlll 

X010 Xioo + Xioi Xooo—Xooi+Xoio — Xoil Xooo + Xooi—X010 — Xoil + Xioo + Xioi—X110 — Xm 
Xoil Xiio~(-Xiii X100 — X101 + X110 — Xlll Xooo — Xooi—Xoio 4 ~Xoi 1+X100 — X101—X110 + X111 
X100 Xooo — Xooi Xooo + Xooi—X010 — Xoil XooO-pXooi+XoiO + Xoii—-X100 — X101—Xno — Xm 
Xioi X010 — Xoil Xioo + Xioi—Xno—Xm Xooo — Xooi+Xoio — Xoil—X10O + X101—X110 + X111 
Xlio X100 — X101 Xooo—Xooi—X010 + X011 XoooYXooi—X010 — Xoil—X100 — X101-|-Xiio+Xlli 
Xlll Xno—Xm Xioo — X101—Xno-|-Xiii Xooo — Xooi—Xoio + Xoi 1—XioO + Xloi-f-XllO — Xm 

To get from the “Given” to the “First step” requires four additions and four 
subtractions; and the interesting feature of Yates’s method is that exactly the 
same transformation that takes us from “Given” to “First step” will take us 
from “First step” to “Second step” and from “Second step” to “Third step.” In 
each case we do four additions, then four subtractions; and after three steps we 
magically have the desired Fourier transform f(si, s 2 , S3) in the place originally 
occupied by F(si, s 2 , S3). 

This special case is often called the Walsh transform of 2 n data elements, 
since the corresponding pattern of signs was studied by J. L. Walsh [Amer. J. 
Math. 45 (1923), 5-24]. Note that the number of sign changes from left to right 
in the “Third step” above assumes the respective values 

0, 7, 3, 4, 1, 6, 2, 5; 

this is a permutation of the numbers {0,1, 2,3,4,5, 6 ,7}. Walsh observed that 
there will be exactly 0, 1, ..., 2 n — 1 sign changes in the general case, if 
we permute the transformed elements appropriately, so the coefficients provide 
discrete approximations to sine waves with various frequencies. (See H. F. 
Harmuth, IEEE Spectrum 6, 11 (Nov. 1969), 82-91, for applications of this 
property; and see Section 7.2.1 for further discussion of the Walsh coefficients.) 

Yates’s method can be generalized to the evaluation of any discrete Fourier 
transform, and, in fact, to the evaluation of any set of sums that can be written 
in the general form 

/(•S1) ^2) • • ■ j Sn) = 

'y ^ 9\ (^l ? S 2 , . . . , 5 n , ti)g 2 (s 2 , . . . , S n , t 2 ) . . . (Jn{Sn) tn)F{t\ , t 2 , . . . , t n ) (39) 

0 <£i <mi 
0 ^ tri fFln 

for 0 < Sj < mj, given the functions g 3 {s 3 ,..., s n , tj). We proceed as follows. 
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f [0] {t 1 ,t 2 , h ,..., t n ) = F(t 1 ,t 2 , t 3 ,..., t n ); 

f [1] i s n,t 1 ,t 2 ,...,t n ~l) = ^2 Qn(s n ,t n )f [0] {ti,t 2 ,..-,t n ); 

0 ^ t>Yi < C ?n n 

^ (®n—1 j U > • • • > tn — 2 ) =: ^ ^ Qn —1 (^n—1 > —1 )/^ ^ (^n j G j • • • > —1)? 

0<t n _i <m n _i 


/ [n] ( 5 i, s 2 , s 3 , ..., s n ) = Y, s n , ti)/ [n 1 ] (s 2 , s 3 , . . ., s n , ti); 

0<ti <mj 

/(Si, s 2 , s 3 , ... , S n ) = / N (Sl, S 2 , S 3 , ... , Sn). (40) 

For Yates’s method as shown above, gj{s 3 , ..., s n , tj) = (—l) Sj U; t 2 , t 3 ) 

represents the “Given”; /^($ 3 ,ti, t 2 ) represents the “First step”; etc. Whenever 
a desired set of sums can be put into the form of (39), for reasonably simple 
functions g 3 (sj,... ,s n , tj), the scheme (40) will reduce the amount of computa¬ 
tion from order N 2 to order N log N or thereabouts, where N = mi. .. m n is 
the number of data points; furthermore this scheme is ideally suited to parallel 
computation. The important special case of one-dimensional Fourier transforms 
is discussed in exercises 14 and 53; we have considered the one-dimensional case 
also in Section 4.3.3. 

Let us consider one more special case of polynomial evaluation. Lagrange’s 
interpolation polynomial of order n, which we shall write as 


u 


M(:c) = 


__ (x —Xi)(x ~ x 2 )... (x — x n ) 


Vo 


(x 0 — Xi)(x 0 — x 2 ) ... (x 0 — x n ) 

(x — X 0 )(x — X 2 ) . . . (x — Xn) 


+ y i 


(Xl — Xo)(Xi — X 2 ) . . . (Xi — X n ) 

(x — x 0 )(x — Xl) ... (x — X n —l) 


+ • • • + Vn 


(X n — X 0 )(x n — Xi) . . . (X n — X n —l) 


(41) 


is the only polynomial of degree < n in x that takes on the respective values 
yo, 2/1, ..., y n at the n -f- 1 distinct points x = x 0 , Xi, ..., x n . (For it is 
evident from (41) that u^(xfc) = for 0 < k < n. If f(x) is any such 
polynomial of degree < n, then g(x) = f(x) — iJ n l(x) is of degree < n, and g(x) 
is zero for x = x 0 , xi, ..., x n ; therefore g(x) is a multiple of the polynomial 
(x — Xo)(x — Xi)... (x — x n ). The degree of the latter polynomial is greater 
than n, so <?(x) = 0.) If we assume that the values of a function in some table 
are well approximated by a polynomial, Lagrange’s formula (41) may therefore 
be used to “interpolate” for values of the function at points x not appearing in 
the table. Unfortunately, there seem to be quite a few additions, subtractions, 
multiplications, and divisions in Lagrange’s formula; in fact, there are exactly 
n additions, 2n 2 -f~2 subtractions, 2n 2 +n—1 multiplications, and n+1 divisions. 
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But fortunately (as we might be conditioned to suspect by now), improvement 
is possible. 

The basic idea for simplifying (41) is to note that v\ n \x) — v\ n ~ ^(x) is zero 
for a; = xo, ..., x n _ x ; thus u^ n \x) — u^ n ~ ^(x) is a polynomial of degree n 
or less, and a multiple of (x — x 0 )... (x — x n _i). We conclude that v\ n \x) = 
a n (x — xq)...(x — x n _i)-j- v\ n ~ ^(x), where a n is a constant. This leads us to 
Newton’s interpolation formula 

u [n] (x) = a n (x — x 0 )(x — xi)... (x — x n _i) -\ - 

+ a 2 (x — x 0 ){x — xi) + ai(x — x 0 ) + a 0 , (42) 

where the a’s are some constants we should like to determine from x 0 , Xi, 
..., x n , Vo, 2 / 1 , y n • Note that this formula holds for all n; the coefficient 

afc does not depend on x fc+1 , ..., x n , or j/fc+i, •••, y n • Once the a’s are 
known, Newton’s interpolation formula is convenient for calculation, since we 
may generalize Horner’s rule once again and write 

u M(x) = ((... (a n (x—x n _i)+a n _i)(x—x n _ 2 )+- • ■ )(x—x 0 ) + a 0 ). (43) 


This requires n multiplications and 2 n additions. Alternatively, we may evaluate 
each of the individual terms of (42) from right to left; with 2n— 1 multiplications 
and 2 n additions we thereby calculate all of the values ul°](x), u^(x),..., v\ n \x), 
and this indicates whether or not an interpolation process is “converging.” 

The coefficients ak in Newton’s formula may be found by computing the 
divided differences in the following tableau (shown for n = 3): 


(yi—yo)/(xi—x 0 ) = y\ 
(y 2 —yi)/{x 2 —xi) = y' 2 
{y3—y2)/(x3—x 2 ) = 2/3 


{y f 2—yi)/( x 2— x o) = y 2 
(y3—y' 2 )/( x 3—xi) = y% 


(2/3 — 2/2 )/(^3 — X 0 ) = I/ 3 " 

(44) 


It is possible to prove that a 0 — y 0 , a x = y[, a 2 = y%, etc., and to show 
that the divided differences have important relations to the derivatives of the 
function being interpolated; see exercise 15. Therefore the following calculation 
(corresponding to (44)) may be used to obtain the a’s: 


Start with (a 0 , aq,..., a n ) <- (y 0 , y lf ..., y n ); then, for k = 1, 2, ..., n 
(in this order), set a 3 <— (a 3 — otj—i)/(xj — Xj _fc) for j = n, n — 1, 
..., k (in this order). 


This process requires ^(n 2 n ) divisions and n 2 n subtractions, so about 

three-fourths of the work implied in (41) has been saved. 
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For example, suppose that we want to estimate §! from the values of 0!, 1!, 
2!, and 3!, using a cubic polynomial. The divided differences are 


x 

0 

1 

2 

3 


v y‘ 

l 

1 

2 
6 


// 


V 


i 

5 



l 

3 


so = u^(x) = 1, u^(x) = Jx(x — 1) + 1, = \x(x — l)(x — 2) + 

2 x(x — 1) +1. Setting x = § in the latter polynomial gives —s + | + l = 1*25; 
presumably the “correct” value is T(§ -|- 1) = I0F « 1.33. 

An important and somewhat surprising application of polynomial interpola¬ 
tion was discovered by Adi Shamir [CACM 22 (1979), 612-613], who observed 
that polynomials mod p can be used to “share a secret.” This means that we 
can design a system of secret keys or passwords such that the knowledge of any 
n -J- 1 of the keys enables efficient calculation of a magic number N that un¬ 
locks a door (say), but the knowledge of any n of the keys gives no information 
whatsoever about N. Shamir’s amazingly simple solution to this problem is to 

choose a random polynomial u(x) = u n x n -|-f- u\X -j- uq, where 0 < U{ < p 

and p is a large prime number. Each part of the secret is an integer x in the 
range 0 < x < p, together with the value of u(x)modp; and the supersecret 
number N is the constant term uq. Given n - f- 1 values u(x{), we can deduce 
N by interpolation. But if only n values of u(xi) are given, there is a unique 
polynomial u(x) having a given constant term but the same values at X\, ..., x n ; 
thus the n values do not make one particular N more likely than any other. 

It is instructive to note that evaluation of the interpolation polynomial is 
just a special case of the Chinese remainder algorithm of Section 4.3.2 and 
exercise 4.6.2-3, since we know the values of v\ n \x) modulo the relatively prime 
polynomials x — xo, ..., x — x n . (As we have seen in Section 4.6.2, f(x) mod 
(x — xo) — f(x o).) Under this interpretation, Newton’s formula (42) is precisely 
the “mixed-radix representation” of Eq. 4.3.2-24; and 4.3.2-23 yields another 
way to compute ao, , a n using the same number of operations as (44). 


By applying fast Fourier transforms, it is possible to reduce the running 
time for interpolation to 0{n (logn) 2 ), and a similar reduction can also be made 
for related algorithms such as the solution to the Chinese remainder problem 
and the evaluation of an nth degree polynomial at n different points. [See E. 
Horowitz, Inf. Proc. Letters 1 (1972), 157-163; R. Moenck and A. Borodin, J. 
Comp. Syst. Sci. 8 (1974), 336-385; and A. Borodin, Complexity of Sequential 
and Parallel Numerical Algorithms, ed. by J. F. Traub (New York: Academic 
Press, 1973), 149-180.] However, this must be regarded as a purely theoretical 
possibility at present, since the known algorithms have a rather large overhead 
factor that makes them unattractive unless n is quite large. 
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A remarkable modification of the method of divided differences, an extension 
that applies to rational functions instead of to polynomials, was introduced by 
T. N. Thiele in 1909. Thiele’s method of “reciprocal differences” is discussed 
in L. M. Milne-Thompson’s Calculus of Finite Differences (London: MacMillan, 
1933), Chapter 5; see also R. W. Floyd, CACM 3 (1960), 508. 

*Bilinear forms. Several of the problems we have considered in this section are 
special cases of the general problem of evaluating a set of bilinear forms 

z k = ^2 tijkXiVj, for 1 < k < s, (45) 

1 <i<m 
1 < y < n 

where the t l]k are specific coefficients belonging to some given field. The three- 
dimensional array (t Z j k ) is called an m X n X s tensor, and we can display it by 
writing down s matrices of size m X n, one for each value of k. For example, 
the problem of multiplying complex numbers, namely the problem of evaluating 

z\ + iz 2 = (xi + ix 2 ){yi + m) = (zi 2 /i—£23/2) + i(x\y 2 +x 2 yi), ( 46 ) 

is the problem of computing the bilinear form specified by the 2X2X2 tensor 



Matrix multiplication as defined in (34) is the problem of evaluating a set of 
bilinear forms corresponding to a particular mn X ns X ms tensor. Fourier 
transforms (37) can also be cast in this mold, although they are linear instead 
of bilinear, if we let the z’s be constant rather than variable. 

The evaluation of bilinear forms is most easily studied if we restrict our¬ 
selves to what might be called normal evaluation schemes, in which all chain 
multiplications take place between a linear combination of the z’s and a linear 
combination of the y’ s. Thus, we form r products 


wi = (aux i-f- \-a m iX m )(buyi-\ - VKiVn), for 1 < l < r, (47) 

and obtain the z’s as linear combinations of these products, 

Zk = CkiWi H- \-c kr w r , for 1 < k < s. (48) 

Here all the a’s, b’ s, and c’s belong to a given field of coefficients. By comparing 
(48) to (45), we see that a normal evaluation scheme is correct for the tensor 
{tij k ) if and only if 


i'ijk — + + ‘ ' ‘ + nir bj r C k f 


for 1 < i < m, 1 < j < n, and 1 < k < s. 


(49) 
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A nonzero tensor ( Ujk ) is said to be of rank one if there are three vectors 
(ai > • • • j (^15 • • • ? j (^i j ■ • • j c a ) such that tijk — bj Ck for all z, j f k. 
We can extend this definition to all tensors by saying that the rank of ( tijk ) is 
the minimum number r such that (tijk) is expressible as the sum of r rank-one 
tensors in the given field. Comparing this definition with Eq. (49) shows that the 
rank of a tensor is the minimum number of chain multiplications in a normal 
evaluation of the corresponding bilinear forms. Incidentally, when s = 1 the 
tensor (t^k) is just an ordinary matrix, and the rank of (Uji) as a tensor is the 
same as its rank as a matrix (see exercise 49). The concept of tensor rank was 
introduced by F. L. Hitchcock in J. Math, and Physics 6 (1927), 164-189; its 
application to the complexity of polynomial evaluation was pointed out in an 
important paper by V. Strassen, J. fiir die reine und angew. Math. 264 (1973), 
184-202. 

Winograd’s scheme (35) for matrix multiplication is “abnormal” because it 
mixes z’s and y’s before multiplying them. The Strassen-Winograd scheme (36), 
on the other hand, does not rely on the commutativity of multiplication, so it is 
normal. In fact, (36) corresponds to the following way to represent the 4x4x4 
tensor for 2 X 2 matrix multiplication as a sum of seven rank-one tensors: 

( 1 0 0 On /0 0 0 On /0 0 1 On /0 0 0 On /I 0 0 On /I 0 0 On /I 0 0 On /I 0 0 On 

0100 W 0000 H 0001 Hooool /oooow 0000 |[ 0000 \l 0000 ) 

0000 II 1000 II 0000 II 0010 / loooo Moooo Moooo /loooo J 
0 0 0 0/ no 1 0 o/ Vo 0 0 0' NO 0 0 V Vo o 0 0' Vo o 0 0' Vo 0 0 0/ M) 0 0 0/ 

( 0 0 0 On /0 0 0 On /0 0 0 On /0 0 0 On /0 0 0 On /o 0 11\ /0 0 0 On /o 0 II\ 
0100 W 0000 j/ 0000 || 0000 ) 1/0000 || 0000 If 0000) 0 0 0 0 I 

0000 II oooo Moooo 11 ooool'loooo II 0011 llooool 0011 I 

o o o o/ Vo o o o^ Vo o o o/ Vo o o o/ Vo o o o/ Vo o o o/ Vo o o o/ Vo o o oy 

( 0 0 0 On /0 0 0 On /0 0 0 On /0 0 0 On /0 0 0 On /0 0 0 On /0 0 0 On /0 0 0 On 

0000 \| 0000 W 0000 j/ 00001,/ 0000 1/ 0000 W 0000 U 0000) 

oooo II oooo II oooo 11 ooool'loooo II oooo II 1010 II loiol 

o o o o/ Vi 11 \J Vo o o o' Vo o o o/ Vo o o o/ Vo o o o/ Vi o i o/ Vi o i o/ 

( 0 0 0 On /OOOOn f0 0 0 l\ /OOOOn /0 0 0 On (1 0 1 l\ (\ 0 1 l\ (\ 0 1 l\ 

0000 W 0000 II 0001 W OOOOl./OOOOU OOOO I OOOO II oooo I 
0000 II 0000 II 0001 II OOOO/'IOOOO II 1011 I lOTl/llOlir 
oooo/Voooo/Voooiy Voooo/ Voooo/ Vioi iy Vioi l/Vloi iy 

(50) 

(Here 1 stands for —1.) 

The fact that (49) is symmetric in i, j, k and invariant under a variety 
of transformations makes the study of tensor rank mathematically tractable, 
and it also leads to some surprising consequences about bilinear forms. We 
can permute the indices i, j, k to obtain “transposed” bilinear forms, and the 
transposed tensor clearly has the same rank; but the corresponding bilinear forms 
are conceptually quite different. For example, a normal scheme for evaluating an 
(m X n) times (n X s) matrix product implies the existence of a normal scheme to 
evaluate an (n X s) times (5 X m) matrix product, using the same number of chain 
multiplications. In matrix terms these two problems hardly seem to be related 
at all—they involve different numbers of dot products on vectors of different 
sizes—but in tensor terms they are equivalent. [Cf. V. Ia. Pan, Uspekhi Mat. 
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Nauk 27,5 (1972), 249-250; J. E. Hopcroft and J. Musinski, SIAM J. Computing 
2 (1973), 159-173.] 

When the tensor (%fc) can be represented as a sum (49) of r rank-one tensors, 
let A, B, C be the matrices (an), ( bji ), (cki) of respective sizes mXr, nXr, sXr ; 
we shall say that A, B, C is a realization of the tensor (Lj k )- For example, the 
realization of 2 X 2 matrix multiplication in (50) can be specified by the matrices 

0 0 110 T\ fl 1 0 0 0 0 

1 0 1 0 0 0 1 r= 101100 
01 I I 01f 100011 
0 110 11/ Vl 0 1 0 1 0 

An mXnX s tensor (t lJ k) can also be represented as a matrix by grouping 
its subscripts together. We shall write (t(ij)fc) for the mn X s matrix whose rows 
are indexed by the pair of subscripts ( i , j) and whose columns are indexed by k. 
Similarly, (tk{ij)) stands for the s X mn matrix that contains Uj k in row k and 
column (i,j)', (t(ik)j) is an ms X n matrix, and so on. The indices of an array 
need not be integers, and we are using ordered pairs as indices here. We can use 
this notation to derive the following simple but useful lower bound on the rank 
of a tensor. 

Lemma T. Let A, B, C be a realization of an m X n X s tensor (Ljk)- Then 
rank(A) > rank(^(jfc)), rank(£) > rank(t,-(**;)), and rank(C) > rank(tfc(u)); 
consequently 

rank (t ijk ) > max(rank(^( jfc )), rank (tj(i k )), rank (tk(ij)))- 

Proof. It suffices by symmetry to show that r > rank(A) > rankf^^fc)). Since 
A is an m X r matrix, it is obvious that A cannot have rank greater than r. 
Furthermore, according to (49), the matrix (ti(jk)) is equal to AQ, where Q is 
the r X ns matrix defined by Qi(j, k ) — b 3 i c k i - If a: is any row vector such that 
xA = 0 then xAQ = 0, hence all linear dependencies in A occur also in AQ. It 
follows that rank(AQ) < rank(A). | 

As an example of the use of Lemma T, let us consider the problem of 
polynomial multiplication. Suppose we want to multiply a general polynomial 
of degree 2 by a general polynomial of degree 3, obtaining the coefficients of the 
product: 

(x 0 + Xiu -f x 2 u 2 )(y 0 + y x u + y 2 u 2 + y 3 u 3 ) 

= zo z x u z 2 u 2 + z 3 u 3 -f z^u 4 -f- Z 5 U 5 . (52) 

This is the problem of evaluating six bilinear forms corresponding to the 3X4X6 
tensor 

fl 0 0 0\ /0 1 0 0\ fO 0 1 0\/0 0 0 1\/0 0 0 0\/0 0 0 0\ 

0000 1000 0100 0010 0001 0000. (53) 

Vo 0 0 0 /Vo 0 0 0 /Vl 0 0 0 /Vo 1 0 0 /Vo 0 1 0 /Vo 0 0 1/ 


/I 0 1 0 0 1 1\ /1 

01000101 B= 0 
0 010111 r 0 
Vo 0 01111/ Vo 
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For brevity, we may write (52) as x(u)y(u) ~ z(u ), letting x(u) denote the 
polynomial x$ -j- %iu + x 2 u 2 , etc. Note that we have come full circle from the 
way we began this section, since Eq. ( 1 ) refers to u(x), not x(u); the notation 
has changed because the coefficients of the polynomials are now the variables of 
interest to us. 

If each of the six matrices in (53) is regarded as a vector of length 12 indexed 
by (i,j), it is clear that the vectors are linearly independent, since they are 
nonzero in different positions; hence the rank of (53) is at least 6 by Lemma T. 
Conversely, it is possible to obtain the coefficients Zo, z\, ..., 25 by making only 
six chain multiplications, for example by computing 

x((%(0), x(l)y{l), ..., x(5)?/(5); (54) 

this gives the values of z(0), z( 1), ..., 2 ( 5 ), and the formulas developed above for 
interpolation will yield the coefficients of z(u). The evaluation of x(j) and y(j) can 
be carried out entirely in terms of additions and/or parameter multiplications, 
and the interpolation formula merely takes linear combinations of these values. 
Thus, all of the chain multiplications are shown in (54), and the rank of (53) 
is 6. (We used essentially this same technique when multiplying high-precision 
numbers in Algorithm 4.3.3C.) 

The realization A , B, C of (53) sketched in the above paragraph turns out 
to be 


/1111 1 i\ 
(0123 4 5), 
V0 1 4 9 16 25/ 


null In 
( 0 1 2 3 4 5 \ 

l 0 1 4 9 16 25 r 
Vo 1 8 27 64 125' 


/ 120 0 0 0 0 On 

—274 600 —600 400 —150 24 

225 —770 1070 —780 305 —50 

—85 355 —590 490 —205 35 

15 —70 130 —120 55 —10 

V —1 5 —10 10 —5 1/ 



(55) 

Thus, the scheme does indeed require the minimum number of chain multiplica¬ 
tions, but it is completely impractical because it involves so many additions 
and parameter multiplications. We shall now study a practical approach to the 
generation of more efficient schemes, suggested by S. Winograd. 

In the first place, to evaluate the coefficients of x(u)y(u) when deg(x) = m 
and deg(p) = n, one can use the identity 


x{u)y{u) = (x(u)p(u) mod p{u)) + x m y n p(u), (56) 

when p(u) is any monic polynomial of degree m-{~n. The polynomial p(u) should 
be chosen so that the coefficients of x(u)y(u)modp(tt) are easy to evaluate. 

In the second place, to evaluate the coefficients of x(w)p(w)modp(w), when 
the polynomial p(u) can be factored into q(u)r(u) where gcd(q(u), r(u)) = I, one 
can use the identity 


x(u)y(u) mod q{u)r{u) — {a(u)r(u)(x(u)y(u) mod q(u )) 

-f f?(w)q'(w)(x(u)?/(w) mod r(u)) j mod q{u)r{u) 


( 57 ) 
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where a(u)r(u) -}* b(u)q(u) = 1; this is essentially the Chinese remainder theorem 
applied to polynomials. 

In the third place, to evaluate the coefficients of x(u)y(u) mod p(u) when 
p(u) has only one irreducible factor over the field of coefficients, one can use the 
identity 

x(u)y(u) mod p(u) — (z(it)modp(it))(y(ii)modp(u))modp(it). (58) 

Repeated application of (56), (57), and (58) tends to produce efficient schemes, 
as we shall see. 

For our example problem (52), let us choose p(u ) = u 5 — u and apply (56); 
the reason for this choice of p(u) will appear as we proceed. Writing p{u) = 
u(u 4 — 1), rule (57) reduces to 

x(u)y(u) mod u(u 4 — 1) = (—( u 4 — l)zo2/o 

+ u 4 (x(u)y(u) mod ( u 4 — 1))) mod (it 5 — it). (59) 

Here we have used the fact that x(u)y(u) mod u = xoyo ; in general it is a good 
idea to choose p(it) in such a way that p(0) = 0, so that this simplification 
can be used. If we could now determine the coefficients iu 0 , W\, w 2 , w 3 of the 
polynomial x(u)y(u) mod (u 4 — 1) = w 0 -f- W\U + w 2 u 2 + w 3 u 3 , our problem 
would be solved, since 

u 4 (x(u)y(u ) mod (it 4 — 1)) mod ( u 5 — u) = w^u 4 -\-Wi it + w 2 u 2 + w 3 u 3 , 
and the combination of (56) and (59) would reduce to 

x{u)y(u) = xoyo + iwx— x 2 y 3 )u+w 2 u 2 +w 3 u 3 + (w 0 ~x 0 yo)u 4 + x 2 y 3 u 5 . (60) 

(This formula can, of course, be verified directly.) 

The problem remaining to be solved is to compute x(u)y(u) mod (it 4 — 1); 
and this subproblem is interesting in itself. Let us momentarily allow x(u) to be 
of degree 3 instead of degree 2. Then the coefficients of x(u)y(u) mod (it 4 — 1) 
are respectively 

XoVo + x iU3 + x 2 y 2 -(- x 3 y \, x$yi -f* a?i3/o X 2V3 “h X 3V2, 

x oy 2 + x\y\ + x 2 Vo + x 3 y 3 , x Q y 3 + x Y y 2 + x 2 y x + x 3 y 0 , 

and the corresponding tensor is 

(\ 0 0 0\ /0 1 0 0\ /0 0 1 0\ /0 0 0 1\ 

[0001 || 1000 W 0100 || 00101 rfil x 

looiolloooilliooolloioor 1 j 

Vo i o 0/ Vo o 10/ Vo o o 1/ Vi o o 0/ 

In general when deg(x) = deg (y) = n —1, the coefficients of x(u)y(u) mod (u n — 1) 
are called the cyclic convolution of (x 0 , x lt . .. , x n _i) and (y 0 , y lf . .., y n —\). The 
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kth coefficient Wk is the bilinear form J 2 x iVj summed over all i and j with 
i + j = k (modulo n). 

The cyclic convolution of degree 4 can be obtained by applying rule (57). 
The first step is to find the factors of u 4 — 1, namely ( u — l)(u, \)(u 2 -f-1). We 

could write this as (u 2 — 1 )(u 2 -}- 1), then apply rule (57), then use (57) again 
on the part modulo ( u 2 — 1) = (u — l)(it + 1); but it is easier to generalize 
the Chinese remainder rule (57) directly to the case of several relatively prime 
factors. For example, we have 

x(u)y(u) mod qi(u)q 2 (u)q 3 (u) 

= ( a 1 (u)q 2 (u)q 3 (u)(x(u)y(u) mod qi(u )) + a 2 (u)q 1 (u)q 3 (u)(x(u)y(u) mod q 2 (u)) 
+ a 3 (u)q 1 (u)q 2 (u)(x(u)y(u) mod q 3 (u))j mod q 1 (u)q 2 (u)q 3 (u), (62) 

where ai(u)q 2 (u)q 3 (u) + a 2 (u)qi(u)q 3 (u) -f a 3 (u)q 1 (u)q 2 (u) = 1. (The latter 
equation can be understood in another way, by noting that the partial fraction 
expansion of l/qi(u)q 2 (u)q 3 (u) is a 1 (u)/q 1 (u)-\-a 2 (u)/q 2 (u)+a 3 (u)/q 3 (u). When 
each of the q's is a linear polynomial u — a t , the generalized Chinese remainder 
rule reduces to ordinary interpolation as in Eq. (41), since f(u)mod(u — a t ) = 
f(oti).) From (62) we obtain 

x(w)?/(w)mod(w 4 — 1) = (- 3+U 4 +u+1 3:(l)y(l) — 7i3 " u2 4 + ^~ 1 a:(—l)y(—1) 

— i ^~r 1 (x{u)y(u) mod ( u 2 + 1))) mod (u A — 1). (63) 

The remaining problem is to evaluate x(u)y(u) mod (u 2 +1), and it is time to 
invoke rule (58). First we reduce x(u) and y(u) mod ( u 2 + 1), obtaining X(u) = 
(x 0 — x 2 ) -f- (xi — x 3 )u, Y(u ) = ( y 0 — y 2 ) -f (y Y — y 3 )u. Then (58) tells us 
to evaluate X(u)Y(u) = Z 3 -j- Z\u + Z 2 u 2 , and to reduce this in turn modulo 
(' u2 + 1 )j obtaining (Z 0 — Z 2 ) -j- Z\U. The job of computing X{u)Y(u) is simple; 
we can use rule (56) with p(u) — u(u + 1) and we get 

z 0 = x 0 y 0 , z x = x 0 r 0 - (x 0 -Xi)(y 0 -yi) + x 1 y 1 , z 2 = x^Yi. 

(We have thereby rediscovered the trick of Eq. 4.3.3-2 in a more systematic way.) 
Putting everything together yields the following realization A, B, C of degree-4 
cyclic convolution: 


1 1 i ° 1\ 

1 i 0 i I 

1 1 1 0 1 r 

1 1 0 I V 


A 1 1 0 1\ 

1 I 0 1 I I 

1 11 o i r 

M 1 0 1 P 


( 1 1 2 2 0 \ 
112 2 2 
11220, 
1 1 2 2 2 ^ 



Here 1 stands for —1 and 2 for —2. 

The tensor for cyclic convolution of degree n satisfies 


(64) 




( 65 ) 
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treating the subscripts modulo n, since Ujk = 1 if and only if i j = k 
(modulo n). Thus if (an), (bji), (cm) is a realization of the cyclic convolution, so 
is (cki), (b—j,i), (an); in particular, we can realize (61) by transforming (64) into 


fl 1 2 2 ON 

1 . J 

fl 1 1 0 l\ 


fl 1 1 0 1\ 

1 I 2 2 2 
112 2 0 
Vl I 2 2 27 

X 

►fs. 1 1 — 

' 1 1 0 I 1 

1 1 1 I 0 1 
Vl I 0 1 17 

J 

l I 0 1 T 
, 1 1 T 0 1 
Vi I 0 1 17 


( 66 ) 


Now all of the complicated scalars appear in the A matrix. This is important 
in practice, since we often want to compute the convolution for many values of 
Vo , yi, 2/2 , 2/3 but for a fixed choice of x 0 , x x , x 2 , z 3 . In such a situation, the 
arithmetic on z’s can be done once and for all, and we need not count it. Thus 
(66) leads to the following scheme for evaluating the cyclic convolution wq, w x , 
w 2 , w 3 when xq, x x , x 2 , x 3 are known in advance: 


m 1 


1 1 


Sl = 2/0 + 2 / 2 , - 52 = 2 / 1 + 2 / 3 , S 3 = Si+S 2 , S 4 = $i — 5 2 , 

S5 = 2 /o ~ 2/2, s 6 = 1/3 — 2 /i, s 7 = s 5 — s 6 ; 

Xq + X 1 -|-X 2+^3 . — *0 — S 1 +Z 2 — x 3 _ Xq-|—XI— x 2 — x 3 

5 * 3 , ”' / 2 — \ -54, » 7 l 3 — 2 

™ —XO+X1+X2—x 3 „ _ x 3 — Xi „ . 

m 4 — -2--56, m 5 — - 2 — ' 5 7, 

= mi+m 2 , t 2 = m 3 + m 5 , t 3 = m x — m 2 , t 4 = m 4 — m 5 ; 

Wq = t x ^2, Wi = 2 3 + t 4 , w 2 = t x — t 2 , w 3 = t 3 — t 4 . 


• S 5 , 


(67) 


There are 5 multiplications and 15 additions, while the definition of cyclic con¬ 
volution involves 16 multiplications and 12 additions. We will prove later that 5 
multiplications are necessary. 

Going back to our original multiplication problem (52), using (60), we have 
derived the realization 


/4 0 1 1 2 2 0\ x 
0011 2 2 2 X-, 
VO 4 1 1 2 2 0/ 


n 0111 0 i\ 
0 0 1 I 0 I 1 
0 0 1 1 I 0 I 

Vo 11 1 01 1 / 


/I 0 0 0 0 0 0\ 

0 I 1 1 0 1 1 
0011101 
0 0 1 I 0 I 1 
I 0 1 1 1 0 1 
Vo 1 0 0 0 0 07 


( 68 ) 


This scheme uses one more than the minimum number of chain multiplications, 
but it requires far fewer parameter multiplications than (55). Of course, it must 
be admitted that the scheme is still rather complicated: If our goal is simply 
to compute the coefficients Zq, z x , ..., z$ of the product of two given polyno¬ 
mials +0 + X\U + x 2 u 2 )(yQ + y x u + y 2 u 2 + y 3 u 3 ), as a one-shot problem, our 
best bet is still to use the obvious method that does 12 multiplications and 6 
additions—unless (say) the z’s and y' s are matrices. Note that if the x’s are fixed 
as the 2 /’s vary, the new scheme does the evaluation with 7 multiplications and 17 
additions. Even though (68) isn’t especially useful as it stands, our derivation has 



494 ARITHMETIC 


4.6.4 


illustrated important techniques that are useful in a variety of other situations. 
For example, Winograd has used this approach to compute Fourier transforms 
using significantly fewer multiplications than the “fast Fourier transform” algo¬ 
rithm needs (see exercise 53). 

Let us conclude this section by determining the exact rank of the nXnXn 
tensor that corresponds to the multiplication of two polynomials modulo a third, 

Zo + Z\U + • ■ • + Z n —\U n ~ 1 

= (xq + x x u-\ -1- x n —iU n ~ 1 ){yQ -\-yiU~\ -b 2/ n -iu n ~ 1 )modp(ti). (69) 

Here p(u) stands for any given monic polynomial of degree n; in particular, p(u) 
might be u n — 1, so one of the results of our investigation will be to deduce the 
rank of the tensor corresponding to cyclic convolution of degree n. It will be 
convenient to write p(u) in the form 

p(u) = u n — p n — \U n 1 - p x u — Po, (70) 

so that u n = po 4- piu -j -(- p n ^iu n ~ l (modulo p(u)). 

The tensor element is the coefficient of u k in ?T+ J modp(w); and this is 
the element in row i, column k of the matrix P 3 , where 

/0 1 0 ... 0 \ 

0 0 1 ... 0 

p= •: • : ; (nj 

0 0 0 ... 1 

VPo Pi P2 ... Pn-1/ 

is called the “companion matrix” of p(u). (The indices i, j, k in our discussion 
will run from 0 to n — 1 instead of from 1 to n .) It is convenient to transpose 
the tensor, for if Tij k — t %k j the individual layers of (T t j k ) for k — 0, 1,2, ..., 
n — 1 are simply given by the matrices 

I P P 2 ... P n ~ l . (72) 

The first rows of the matrices in (72) are respectively the unit vectors 
(1,0,0,..., 0), (0,1,0,..., 0), (0,0,1,..., 0), ..., (0,0,0, ...,1), hence a linear 
combination such as '52 0<k<n vi <: P k will be the zero matrix if and only if the 
v k are all zero. Furthermore, most of these linear combinations are actually 
non singular matrices, for we have 

{w 0 ,w 1 ,...,w n - l ) ^2 v k P k = (0,0,..., 0) 

0< fc< n 

if and only if v(u)w(u) = 0 (modulo p{u)), 

where v(u) — v 0 4 - v\u + • • • + v n— \u n ~ 1 and w(u) = Wo -j- w±u + • • ■ + 
w n —iu n ~L Thus, J2Q< k<n v kP k is a singular matrix if and only if the poly¬ 
nomial v(u) is a multiple of some factor of p{u). We are now ready to prove the 
desired result. 



4 . 6.4 


EVALUATION OF POLYNOMIALS 495 


Theorem W (S. Winograd, 1975). Let p(u ) be a monic polynomial of degree n 
whose complete factorization over a given infinite held is 

p(u) = p 1 (u) ei ...p q (u) e *. (73) 

Then the rank of the tensor (72) corresponding to the bilinear forms (69) is 2 n—q 
over this held. 


Proof. The bilinear forms can be evaluated with only 2 n—q chain multiplications 
by using rules (56), (57), (58) in an appropriate fashion, so we must prove only 
that the rank r is > 2 n — q. The above discussion establishes the fact that 
rank(T(jj)fc) = n; hence by Lemma T, any n X r realization A, B , C of (T ljk ) 
has rank(C) = n. Our strategy will be to use Lemma T again, by finding a 
vector (i>o, Vi, ..., u n _i) that has the following two properties: 

a) The vector (uo, Vi,..., v n ^i )C has at most q -j- r — n nonzero coefficients. 

b) The matrix v(P) = Eo<fc<n^^ is nonsingular. 

This and Lemma T will prove that q -f- r — n > n, since the identity 

^ ^ ^il bjl ( ^ ^ V k C k [ 

1 <l<r M)<fc<n 

shows how to realize the n X n X 1 tensor v(P) of rank n with q -\- r — n chain 
multiplications. 

We may assume for convenience that the first n columns of C are linearly 
independent. Let D be the n X n matrix such that the first n columns of 
DC are equal to the identity matrix. Our goal will be achieved if there is a 
linear combination (v 0 , V\, ..., v n _i) of at most q rows of D, such that v(P) is 
nonsingular; such a vector will satisfy conditions (a) and (b). 

Since the rows of D are linearly independent, no irreducible factor p\(u) 
divides the polynomials corresponding to every row. Given a vector w = 
(w 0 ,wi,... ,w n —i), let “covered(ie)” be the set of all X such that w(u) is not 
a multiple of p\(u). From two vectors v and w we can find a linear combination 
v + aw such that 

covered(u + aw) = covered(u) U covered(ie), (74) 

for some a in the field. The reason is that if X is covered by v or w but not both, 
then X is covered by v -\- aw for all nonzero a; if X is covered by both v and w 
but X is not covered by v + aw, then X is covered by v -f- j3w for all (3 a. 
By trying q -f- 1 different values of a, at least one must yield (74). In this way 
we can systematically construct a linear combination of at most q rows of D, 
covering all X for 1 < X < q. | 
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One of the most important corollaries of Theorem W is that the rank of a 
tensor can depend on the field from which we draw the elements of the realization 
A, B, C. For example, consider the tensor corresponding to cyclic convolution of 
degree 5; this is equivalent to multiplication of polynomials mod p(u) = u 5 — 1. 
Over the field of rational numbers, the complete factorization of p(u ) is (u — 1)X 
( u 4 -f- u 3 -f- u 2 -f- u -f-1) by exercise 4.6.2-32, so the tensor rank is 10 — 2 = 8. 
On the other hand, the complete factorization over the real numbers, in terms 
of the number 0 = ^(1 -f- \/5 ), is (it — l)(u 2 + <jm + 1 ){u 2 — (j)~ l u 1); thus, 
the rank is only 7, if we allow arbitrary real numbers to appear in A, B, C. 
Over the complex numbers the rank is 5. This phenomenon does not occur in 
two-dimensional tensors (i.e., matrices), where the rank can be determined by 
evaluating determinants of submatrices and testing for 0. The rank of a matrix 
does not change when the field containing its elements is embedded in a larger 
field, but the rank of a tensor can decrease when the field gets larger. 

In the paper that introduced Theorem W [Math. Systems Theory 10 (1977), 
169-180], Winograd went on to show that all realizations of (69) in 2 n — q 
chain multiplications correspond to the use of (57), when q is greater than 1. 
Furthermore he has shown that the only way to evaluate the coefficients of 
x(u)y(u) in deg(z) + deg(y) + 1 chain multiplications is to use interpolation or 
to use (56) with a polynomial that splits into distinct linear factors in the field. 
Finally he has proved that the only way to evaluate x(u)y(u) mod p(u) in 2 n — 1 
chain multiplications when q = 1 is essentially to use (58). These results hold 
for all polynomial chains, not only “normal” ones. He has extended the results 
to multivariate polynomials in SIAM J. Computing 9 (1980), 225-229. 

The tensor rank of an arbitrary m X n X 2 tensor in a suitably large field 
has been determined by Joseph Ja’Ja’, SIAM J. Computing 8 (1979), 443-462. 

For further reading. In this section we have barely scratched the surface of a 
very large subject in which many beautiful theories are emerging; a considerably 
more comprehensive treatment appears in the book Computational Complexity 
of Algebraic and Numeric Problems by A. Borodin and I. Munro (New York: 
American Elsevier, 1975). 


EXERCISES 

1. [15] What is a good way to evaluate an “odd” polynomial 

U(x) = U 2 n+lX 2n+1 + U 2 n-lX 2n ~ 1 -\ -(- UiX? 

► 2. [M20] Instead of computing u{x -f- xo) by steps HI and H2 as in the text, discuss 
the application of Horner’s rule (2) when polynomial multiplication and addition are 
used instead of arithmetic in the domain of coefficients. 

3. [20] Give a method analogous to Horner’s rule, for evaluating a polynomial in 
two variables .< n u ijX l yT (This polynomial has (n-j-l)(n-(-2)/2 coefficients, and 

“total degree” n.) Count the number of additions and multiplications you use. 
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4. [ M20} The text shows that scheme (3) is superior to Horner’s rule when we are 
evaluating a polynomial with real coefficients at a complex point z. Compare (3) to 
Horner’s rule when both the coefficients and the variable z are complex numbers; how 
many (real) multiplications and addition-subtractions are required by each method? 

5. [ M15 ] Count the number of multiplications and additions required by the second- 
order rule (4). 

6. [22] (L. de Jong and J. van Leeuwen.) Show how to improve on steps Si, ..., S4 
of the Shaw-Traub algorithm by computing only about powers of Xq. 

7. [ M24 ] How can j3o, ..., (3 n be calculated so that (6) has the value u(xq kh) for 
all integers k ? 

8 . [M20] The factorial power x- is defined to be k'.(fy = x(x — 1)... (x — k 1). 
Explain how to evaluate u n x - + • • • + Uix- -f- Uo with at most n multiplications and 
2 n — 1 additions, starting with x and the n -\- 3 constants u n , • • •, uo, 1, n — 1. 

9. [M24] (H. J. Ryser.) Show that if X = ( Xij ) is an n X n matrix, then 

per(X) = ]jj-l) n-£1 £n JJ ^ e J x '3 

1 < i < n 1 < j < n 

summed over all 2 n choices of ei, e n equal to 0 or 1 independently. Count the 
number of addition and multiplication operations required to evaluate per(X) by this 
formula. 

10. [ M21 ] The permanent of an n X n matrix X = ( Xi 3 ) may be calculated as follows: 
Start with the n quantities in, X 12 , ■ ■ ■, %in- For 1 < k < n, assume that the (£) 
quantities Aks have been computed, for all A:-element subsets S of {1, 2,..., n}, where 
Aks — ... Xkj k summed over all k\ permutations ji... jk of the elements of S; 

then form all of the sums 


Afc-M )s = TAhs\m +1)3 . 

j€S 

We have per(X) = A n{ 

How many additions and multiplications does this method require? How much 
temporary storage is needed? 

11. [M46] Is there any way to evaluate the permanent of a general nXn matrix using 
fewer than 2 n arithmetic operations? 

12. [M50] What is the minimum number of multiplications required to form the 
product of two n X n matrices? What is the smallest exponent (3 such that 0(n /3+e ) 
multiplications are sufficient for all e > 0? 

13. [ M2S ] Find the inverse of the general discrete Fourier transform (37), by express¬ 
ing F(ti, ..., t n ) in terms of the values of f(s i,..., s n ). [Hint: See Eq. 1.2.9-13.) 

► 14. [HM28\ (“Fast Fourier transforms .”) Show that the scheme (40) can be used to 
evaluate the one-dimensional discrete Fourier transform 

f{s) = ^2 F ( t ) uj3t ’ u = e 27ri/2n , 0 < s < 2 n , 

0 < i < 2 n 

using arithmetic on complex numbers. Estimate the number of arithmetic operations 
performed. 
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► 15. [ HM28] The nth divided difference f(xo,Xi,..., x n ) of a function f(x) at n -(- 1 
distinct points xo, xi, ..., x n is defined by the formula 

f(x 0 , Xi, . . . , Xn) = {f(x 0 , Xi, . . . , Xn- 1 ) — f{X\, ..., X n -l, X n ))/{xo — X n ), 

for n > 0. Thus f{x 0 ,x l,...,X n ) = Eo<fc<n /(*fc)/ Ilo<i<» f — Sj) is a 

symmetric function of its n + 1 arguments, (a) Prove that f(x o,..., x n ) = / (n) (0)/n!, 
for some 9 between min(xo,..., x n ) and max(xo, ... ,x n ), if the nth derivative / (n) (x) 
exists and is continuous. [Hint: Prove the identity 

r 1 rt i rtn— i 

f(Xo,Xi,...,Xn) = dt 1 / dt 2 ... dt n f {n \xo(l—t 1 ) + xi{ti—t 2 )^ - 

Jo Jo Jo 

Xn — 1 (tn—1 tn) — Xn{tn 9)). 

This formula also defines f{xo, X\,..., x n ) in a useful manner when the Xj are not 
distinct.] (b) If yj = f(xj), show that a.j = f(xo,..., Xj) in Newton’s interpolation 
polynomial (42). 

16. [M22] How can we readily compute the coefficients of u^(x) = u n x n -|— ■ -|-uo, 
if we are given the values of xo, xi, ..., x n —i, Qo, oci, ..., a n in Newton’s interpolation 
polynomial (42)? 

17. [M45] Is there a way to evaluate the polynomial 

X{Xj — X\X 2 | ■ • • ) Xn —1 Xn 

1 <i< j <n 

with fewer than n — 1 multiplications and 2n — 4 additions? (There are ( 2 ) terms.) 

18. [M20] If the fourth-degree scheme (9) were changed to 

V = (x + 00 ) 2 : + ai, u(x) = ((y — x + a 2 )y + 03 ) 0 : 4 , 

what formulas for computing the o/s in terms of the Uk s would take the place of (10)? 

► 19. [ M24} Explain how to determine the adapted coefficients o 0 , o 1 , 05 in 

(11) from the coefficients U 5 , ..., Ui, uo of u(x), and find the o’s for the particular 
polynomial u(x) = x 5 -f- 5x 4 — 10x 3 — 50x 2 -f- 13x -\- 60. 

► 20. [21] Write a MIX program that evaluates a fifth-degree polynomial according 
to scheme (11); try to make the program as efficient as possible, by making slight 
modifications to (11). Use Mix’s floating point arithmetic operators FADD and FMUL, 
which are described in Section 4.2.1. 

21. [20] Find two additional ways to evaluate the polynomial x 6 -f- 13x 5 + 49x 4 + 

33z 3 — 61z 2 — 37x 3 by scheme (12), using the two roots of (15) that were not 

considered in the text. 

22. [18] What is the scheme for evaluating x 6 — 3x 5 -|- x 4 — 2x 3 -f- x 2 — 3x — 1, using 
Pan’s method (16)? 
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23. [HM30] (J. Eve.) Let f(z) = a n z n -|- a n — 1 -|- ■ • • -|- a 0 be a polynomial of 
degree n with real coefficients, having at least n — 1 roots with a nonnegative real part. 
Let 

g{Z) — a n Z a n — 2Z -j- CLn mod 2 z , 

fo\Z )— a n — \Z + Cln-ZZ 1-r fl (n-l)mod2^ 

Assume that h(z) is not identically zero. 

a) Show that g(z ) has at least n — 2 imaginary roots (i.e., roots whose real part is 
zero), and h(z ) has at least n — 3 imaginary roots. [Hint: Consider the number of 
times the path f(z) circles the origin as z goes around the path shown in Fig. 15, 
for a sufficiently large radius R.] 

b) Prove that the squares of the roots of g(z) = 0 and h(z) = 0 are all real. 



Fig. 15. Proof of Eve’s theorem. 

► 24. [M24] Find values of c and otk, Pk satisfying the conditions of Theorem E, for 
the polynomial u(x) = (x -f 7)(x 2 -j- 6x + 10)(a; 2 -f 4x + 5)(z + 1). Choose these values 
so that fa — 0. Give two different solutions to this problem! 

25. [ M20 ] When the construction in the proof of Theorem M is applied to the (ineffi¬ 
cient) polynomial chain 


Xi = ot\ -f- Xo, X2 = —Xo — Xo, X3 ~ Xi -j - Xi, X4 = 0!2 X X3, 

^5 = Xo — Xo, Xg = 0:6 — X5, X7 = a7 Xg, Xs = X7 X X7, 

Xg = Xi X X4, X10 = Og — Xg, Xu = X3 — X10, 

how can Pi, p 2 , • • •, @9 be expressed in terms of a.\, ..., a 8 ? 

► 26. [M21] (a) Give the polynomial chain corresponding to Horner’s rule for evaluating 
polynomials of degree n = 3. (b) Using the construction that appears in the text’s 
proof of Theorem A, express /ci, k 2 , /C 3 , and the result polynomial u(x) in terms of Pi, 
P 2 , P 3 , Pa, and x. (c) Show that the result set obtained in (b), as Pi, p 2 , Pz, and /? 4 
independently assume all real values, omits certain vectors in the result set of (a). 

27. [M22] Let R be a set that includes all (n-bl)-tuples ( q n ,..., qi,qo) of real numbers 
such that q n 7 ^ 0; prove that R dpes'not have at most n degrees of freedom. 

28. [HM20] Show that if fo{oci ,..., a s ), ..., / s (qi, ..., a s ) are multivariate polyno¬ 
mials with integer coefficients, then there is a nonzero polynomial g(xo,-..,x s ) with 
integer coefficients such that g(fo{ai,, a s ),..., f s {ct 1 ,..., a s )) = 0 for all real on, 
..., a s . (Hence any polynomial chain with s parameters has at most s degrees of 
freedom.) [Hint: Use the theorems about “algebraic dependence” that are found, for 
example, in B. L. van der Waerden’s Modern Algebra, tr. by Fred Blum (New York: 
Ungar, 1949), Section 64.] 
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► 29. [ M20 ] Let R x , R 2 , ..., Rm all be sets of (n-f- 1)-tuples of real numbers having at 
most t degrees of freedom. Show that the union Hi U R 2 U ■ • • U Rm also has at most 
t degrees of freedom. 

► 30. [M28] Prove that a polynomial chain with m c chain multiplications and m p 
parameter multiplications has at most 2 m c m p -f- <5om c degrees of freedom. [Hint: 
Generalize Theorem M, showing that the first chain multiplication and each parameter 
multiplication can essentially introduce only one new parameter into the result set.] 

31. [M23] Prove that a polynomial chain capable of computing all monic polynomials 
of degree n has at least \n/2\ multiplications and at least n addition-subtractions. 

32. [M24] Find a polynomial chain of minimum possible length that can compute all 
polynomials of the form u±x 4 -f- u 2 x 2 -(- uq; and prove that its length is minimal. 

► 33. [ M25] Let n > 3 be odd. Prove that a polynomial chain with [n/2j -f- 1 
multiplication steps cannot compute all polynomials of degree n unless it has at least 
n -(- 2 addition-subtraction steps. [Hint: See exercise 30.] 

34. [ M26] Let \o, Xi, ..., X r be a polynomial chain in which all of the addition 
and subtraction steps are parameter steps, and in which there is at least one param¬ 
eter multiplication. Assume that this scheme has m multiplications and k = r — m 
addition-subtractions, and that the polynomial computed by the chain has maximum 
degree n. Prove that all polynomials computable by this chain, for which the coefficient 
of x n is not zero, can be computed by another chain that has at most m multiplications 
and at most k additions, and no subtractions; and whose last step is the only parameter 
multiplication. 

► 35. [ M25 ] Show that any polynomial chain that computes a general fourth-degree 
polynomial using three multiplications must have at least five addition-subtractions. 
[Hint: Assume that there are only four addition-subtractions, and show that exercise 34 
applies; this means the scheme must have a particular form that is incapable of 
representing all fourth-degree polynomials.] 

36. [ M27} Show that any polynomial chain that computes a general sixth-degree poly¬ 
nomial using only four multiplications must have at least seven addition-subtractions. 
(Cf. exercise 35.) 

37. [ M21] (T. S. Motzkin.) Show that “almost all” rational functions of the form 

(u n X Un—lX +'•*-(- U\X -|- Uo)/{x n -j- V n — ix" -f- * • ■ -j- V\X -f- Vo), 

with coefficients in a field S, can be evaluated using the scheme 


-j- Pl/(x + Ot 2 + (3 2 /(x -\ --j- (3n/{x 4- c*n+l) •••))> 

for suitable ct 3 , in S. (This continued fraction scheme has n divisions and 2 n 
additions; by “almost all” rational functions we mean all except those whose coeffi¬ 
cients satisfy some nontrivial polynomial equation.) Determine the a’s and (3’s for the 
rational function (x 2 -{- 10x + 29)/(x 2 -j- 8x + 19). 
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► 38. [ HM82) (V. Ia. Pan, 1962.) The purpose of this exercise is to prove that Horner’s 
rule is really optimal if no preliminary adaptation of coefficients is made; we need n 

multiplications and n additions to compute u n x n -\ -[- x —(— uq, if the variables u n , 

.Ui, uq, x, and arbitrary constants are given. Consider chains that are as before 
except that u n , ..., ui, uq, x are each considered to be variables; we may say, for 
example, that \—j —i = Uj, \q = x. In order to show that Horner’s rule is best, it 
is convenient to prove a somewhat more general theorem: Let A — (), 0 < i < m, 
0 < j < n > be an (m -f 1) X (n -j- 1) matrix of real numbers, of rank n + 1; and let 
B = {b 0 i ■ • ■ , b m ) be a vector of real numbers. Prove that any polynomial chain that 
computes 

P(x;u 0) • • • ,Un) = ^ {dioUo 4 - CLinUn b l )x t 

0 < i < m 

involves at least n chain multiplications. (Note that this does not mean only that 
we are considering some fixed chain in which the parameters ctj are assigned values 
depending on A and B; it means that both the chain and the values of the a’s may 
depend on the given matrix A and vector B. No matter how A, B, and the values 
of otj are chosen, it is impossible to compute P{x\ Uo, ..., u n ) without doing n “chain- 
step” multiplications.) The assumption that A has rank n-f 1 implies that m > n. 
[Hint: Show that from any such scheme we can derive another that has fewer chain 
multiplications and that has n decreased by one.] 

39. [M29] (T. S. Motzkin, 1954.) Show that schemes of the form w\ = x{x-\-a\)-\~P\ i 

Wk = Wk—i{wi + + a*;) + 6kX -f Pk for 1 < k < m, where the a*, fik are real 

and the 7 *, 6k are integers, can be used to evaluate all monic polynomials of degree 
2m over the real numbers. (We may have to choose ak, Pk, Ik, and 6k differently for 
different polynomials.) Try to let 6k — 0 whenever possible. 

40. [M41] Can the lower bound in the number of multiplications in Theorem C be 
raised from [ri/ 2 J -f 1 to \nj 2 ] -f 1 ? (Cf. exercise 33 .) 

41. [22] Show that the real and imaginary parts of (a -f- bi)(c + di) can be obtained 
by doing 3 multiplications and 5 additions of real numbers, where two of the additions 
involve a and b only. 

42. [56] (M. Paterson and L. Stockmeyer.) (a) Prove that a polynomial chain with 
m > 2 chain multiplications has at most m 2 -f -1 degrees of freedom, (b) Show that for 
all n > 2 there exist polynomials of degree n, all of whose coefficients are 0 or 1 , that 
cannot be evaluated by any polynomial chain with fewer than [\/nJ multiplications, if 
we require all parameters 07 to be integers, (c) Show that any polynomial of degree n 
with integer coefficients can be evaluated by an all-integer algorithm that performs at 
most 2[y/n\ multiplications, if we don’t care how many additions we do. 

43. [22] Explain how to evaluate x n -|-f- x + 1 with 2 l(n + 1) — 2 multiplications 

and l(n-\- 1 ) additions (no divisions or subtractions), where l(n) is the function studied 
in Section 4.6.3. 

► 44. [HM22] Let {t l3 k) be an m X n X s tensor, and let F, G, H be nonsingular matrices 
of respective sizes m X m, n X n, s X s. If 

T t jk — El<p<m El<q<n El<r<s FipGjq Hkr t pqr 

for all i, j, k, prove that the tensor (T l3 k) has the same rank as {tijk). [Hint: Consider 
what happens when E~" 1 , G _1 , H~ 1 are applied in the same way to {T ljk ).] 
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45. [M28] Prove that all pairs (z\, Z 2 ) of bilinear forms in (x\, X 2 ) and (yi, ^ 2 ) can be 
evaluated with at most three chain multiplications. In other words, show that every 
2X2X2 tensor has rank < 3. 

46. [ M25 ] Prove that for all ra, n, and s there exists an m X n X s tensor whose rank 
is at least \mns/(m -j- n -J- s)]. Conversely, show that every mXnX s tensor has rank 
at most rans/max(ra, n, s). 

47. [ M 48 ] Is it possible to determine the rank of any given tensor (Ujk) over, say, the 
field of rational numbers, in a finite number of steps? (There is a finite way to compute 
the tensor rank over algebraically closed fields like the complex numbers, since this is a 
special case of the results of Alfred Tarski, A Decision Method for Elementary Algebra 
and Geometry, 2nd ed. (Berkeley, California: Univ. of California Press, 1951); but the 
known algorithms do not make this computation really feasible except for very small 
tensors. Over the field of rational numbers, the problem isn’t even known to be solvable 
in finite time.) 

48. [M49] If (Ujk) and (tC fc ) are tensors of sizes m X n X s and m! X n' X s ', 

respectively, their direct sum (iijfc)® (tij k ) — ( t " jk ) is the (m+m') X (n + n') X (s + s') 
tensor defined by t" 3k = Ujk if i < m, j < n, k < s; t" jk = t' l _ mi3 _ n>k _ s if i > m, 
j > n, k > s; and t" jk = 0 otherwise. Their direct product (Ujk) ® (Ajk) — (tijk) 
is the mm' X nn' X ss' tensor defined by = Ujkt'i>y k >. Derive the upper 

bounds rank (t" jk ) < rank(tijfc) + rank(tC fc ) and rank(t'" fc ) < rank(^ jfc ) • rank(£L fc ). 

► 49. [ HM25) Show that the rank of an m X n X 1 tensor (Ujk) is the same as its rank 
as an m X n matrix (Uji), according to the traditional definition of matrix rank as the 
maximum number of linearly independent rows. 

50. [HM20] (S. Winograd.) Let (Ujk) he the mn X n X m tensor corresponding to 
multiplication of an m X n matrix by an n X 1 column vector. Prove that the rank of 
(Ujk) is mn. 

► 51. [ M24 ] (S. Winograd.) Devise an algorithm for cyclic convolution of degree 2 that 
uses 2 multiplications and 4 additions, not counting operations on the Xi. Similarly, 
devise an algorithm for degree 3, using 4 multiplications and 11 additions. (Cf. (67), 
which solves the analogous problem for degree 4.) 

52. [ M25 ] (S. Winograd.) Let n — n'n" where gcd (n',n") = 1. Given normal 
schemes for cyclic convolutions of degrees n! and n", using respectively (m!, m") chain 
multiplications, ( p',p") parameter multiplications, and (a', a") additions, show how 
to construct a normal scheme for cyclic convolution of degree n using m'm" chain 
multiplications, p'n" + m'p" parameter multiplications, and a'n" + m'a" additions. 

1$. [HM 40 ] (S. Winograd.) Let a; be a complex rath root of unity, and consider the 
one-dimensional discrete Fourier transform 

f(s) = ^2 for 1 < s < ra. 

1 < i < m 

(a) When ra = p e is a power of an odd prime, show that efficient normal schemes for 
computing cyclic convolutions of degrees (p — l)p k , for 0 < k < e, will lead to efficient 
algorithms for computing the Fourier transform on ra complex numbers. Give a similar 
construction for the case p = 2. (b) When ra = m'm" and gcd(ra', ra") = 1, show that 
Fourier transformation algorithms for m' and ra" can be combined to yield a Fourier 
transformation algorithm for ra elements. 
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54. [ M23} Theorem W refers to an infinite field. How many elements must a finite 
field have in order for the proof of Theorem W to be valid? 

55. [HM22] Determine the rank of tensor (72) when P is an arbitrary n X n matrix. 

56. [MS2] (V. Strassen.) Show that any polynomial chain that evaluates a set of 

quadratic forms J<n T v kXlXk f° r 1 < k < 5 must use at least ^rankfaj* -(- r 3 ik) 

chain multiplications. [Hint: Show that the minimum number of chain multiplications 
is the minimum rank of {Ujk) taken over all tensors {Ujk) such that Ujk + tjik = 
Tijk -{-Tjik for all i, j, k.\ Use this to prove that any polynomial chain that evaluates a 
set of bilinear forms (45) corresponding to a tensor {Ujk), whether normal or abnormal, 
must use at least ^rank(tijfc) chain multiplications. 

57. [M20\ Show that fast Fourier transforms can be used to compute the coefficients 
of the product x{u)y(u) of two given polynomials of degree n, using 0 (n log n) operations 
of (exact) addition and multiplication of complex numbers. [Hint: Consider the product 
of Fourier transforms of the coefficients.] 

58. [ HM28 ] (a) Show that any realization A, B, C of the polynomial multiplication 
tensor (53) must have the following property: Any nonzero linear combination of the 
three rows of A must be a vector with at least four nonzero elements; and any nonzero 
linear combination of the four rows of B must have at least three nonzero elements, 
(b) Find a realization A, B, C of (53) using only 0, +1, and —1 as elements, where 
t = 8 . Try to use as many 0’s as possible. 

► 59. [M40] (H. J. Nus 8 baumer, 1980.) The text defines the cyclic convolution of two 
sequences (xq, Xi,. .., x n —i) and (y 0 , yi, ..., y n — 1) to be the sequence ( zq , z \,..., z n —\) 
where Zk = Xoyk + • • • + Xkyo -f- Zfc+iyn —1 +■■•-(- x n -iyk+i- Let us define the 
negacyclic convolution similarly, but with 


Zk — xo yk • • • -f- Xkyo — (£fc+iyn—1 -)-••• -f- x n — ly/c+i)- 

Construct efficient algorithms for cyclic and negacyclic convolution over the integers 
when n is a power of 2. Your algorithms should deal entirely with integers, and 
they should perform at most O(nlogn) multiplications and at most 0(n log n log log n) 
additions or subtractions or divisions of even numbers by 2. 

60. [M27] (V. Ia. Pan.) The problem of (m X n) times (n X s) matrix multiplication 
corresponds to an mn X ns X sm tensor {t(i,j')(j,k')(k,i')) where = 1 if 

and only i! — i and f = j and k' = k. The rank of this tensor T(m,n,s) is the 
smallest r such that numbers a t jn, bjkn, Cki'i exist satisfying 

%ij Ujk Zki 

1 <i <m 
1 < j < n 
1 <k<s 

Let M{ri) be the rank of T(n,n,n). The purpose of this exercise is to exploit the 
symmetry of such a trilinear representation, obtaining efficient realizations of matrix 
multiplication over the integers when m — n ~ s = 2v. For convenience we divide 
the indices {1,..., n} into two subsets O = {1,3,..., n — 1} and E — {2,4,..., n} of 
v elements each, and we set up a one-to-one correspondence between O and E by the 
rule 1 = i 1 if i £ O) 1 = 1 — 1 if i £ E. Thus we have % = i for all indices i. 


= E £ (Tij'l Xij> E bjk'l Vjk' E 


Cfct'i Zki> 


1 <i<r 


1 < z < m 
1 <j'<n 


1 <j<n 
1 <k'<s 


1 <k<s 
1 <i' <m 




504 ARITHMETIC 


4.6.4 


a) The first construction is based on the identity 

abc + ABC = (a + A){b + B)(c + C) - (a + A)bC - A{b + B)c - aB{c + C). 
It follows that 

Xij y jk Zki = {Xij 4- Xki)(yjk + y~i~ 3 ){zki + z~ 3 k) — Ei — E 2 — E 3 , 

l<i,j,k<n (i,j,k)eS 

where S = ExExE U ExExO U ExOxE U OxExE is the set of all triples 
of indices containing at most one odd index; Ei is the sum of all terms of the 
form ( Xij -f- Xki)yjk z~ 3 z for (i, j, k) £ 5; and E 2 , E 3 similarly are sums of the terms 
XRi{yjk + yi 3 )zki, Xijyij(zki + Zjn). Clearly S has 4u 3 = ^n 3 terms. Show that 
each of Ei, E 2 , E 3 can be realized as the sum of 3v 2 trilinear terms; furthermore, 
if the 3v triples of the forms {i, i, i) and (i, i, i) and (z, i, i) are removed from S, we 
can modify Ei, E 2 , and E 3 in such a way that the identity is still valid, without 
adding any new trilinear terms. Thus M(n) < ^n 3 — §n + f n 2 when n is even. 

b) Apply the method of (a) to show that two independent matrix multiplication 
problems of size m X n X s can be performed with mns -f- ran -)- ns + sm 
noncommutative multiplications. 

c) The second construction is based on the identity 

abc + ABC + ABC = {a + A+ A){b + B + B){c + C + C) 

- (aB(c + C + C) + Ab{c + C + C) + AB(c + C + Q) 

- (a{b + B)C + A(B + B)C + A(3 + b)c) 

— ((& + A)bC 4- (A 4~ a) Be 4” {A 4" A)BC'j 

— (aBC 4- ABc 4- AbC). 

Show that 

V. XijyjkZki= ^2 A:;e, f,??) —El — E 2 — E 3 ; 

l<t ,j,k<n (i,j,k)GS 

0 < e, < 1 

here ]) = ((—l) ? +^x t + e)J+? 44—l) e +% + ^ fc+£ +(—l)^+ e x fc+ ^ +77 ). 

((-ir +£ y J+f , fc+r? + (-l) ?+7? ^ + ^+,4-(-l) €+ ^ ) ; +£ )-((-l) e+ ^+,^+ e + 

(— l) r,+e Zi+ (l j+ v 4- (— Zj+t'k+s ) corresponds to the first term on the right- 
hand side of the above identity and Ei, E 2 , E 3 correspond respectively to the 
next three groups of terms; the remaining terms (namely those corresponding to 
aBC 4- ABc -f- AbC ) cancel out of the sum. The set S in this case is different from 
the S in part (a); it consists of all ( i,j, k) £ OxOxO such that i < j and i < k. 
It follows from this construction that M(n ) < |((f) 3 — (f)) + 6n 2 when n is 
even. 

61. [M23] Let (Ujk) be a tensor over an arbitrary field. We define rankd^ij/e) as the 
minimum value of r such that there is a realization of the form 

T: au(u)bji(u)c k i{u) = t XJ kU d 4- 0(n d+1 )> 

1 <l<r 

where au(u), bji(u), Ckiiu ) are polynomials in u over the field. Thus ranko is the 
ordinary rank of a tensor. Prove that (a) rank d +i (£*?*:) < rankd(£ijfc)l (b) rank^jt) < 
( d ^ 2 )rankd(ttjfc); (c) rank d ((®fc) ® (40) < rankd(tijfc) 4- rankd(40> in the sense of 
exercise 48; (d) rankd+d'((t»jk) ® (40) < rank d (®0 • rankd/(40- 
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62. [M24] The border rank of (Ujk), denoted by rank ( 0 fc), is min^o rankd(t^jt), 
where rankd is defined in exercise 61. Prove that the tensor (q q) has rank 3 but 
border rank 2 , over every field. 

63. [ M30} Let T(m, n, s ) be the tensor for matrix multiplication as in exercise 59. 
(a) Show that T(m,n,s)(3 T(M,N,S) = T(mM,nN, sS). (b) If T(m,n,s) has 
rank < r, show that the number of noncommutative multiplications M(N) for the 
product of N X N matrices is 0(_/v / ^ m,n ’ s,r) ) as N —► oo, where p(m,n, s,r) = 
31ogr/logmns. (c) If T(m,n,s ) has border rank < r, show that we have M(N) = 
0(N / 3 ( ' Tn,ri ’ s,T \\ogN) 2 ). (d) If the tensor pT(m, n, 5 ) obtained by taking the direct sum 
of p copies of T(m, n, s ) has border rank < pr, where p < r, show that a similar 
formula holds. (This tensor corresponds to p independent matrix multiplications.) 

64. [M 40 } (Pan and Winograd.) The purpose of this exercise is to combine the ideas 
of exercises 60-63, in order to obtain asymptotic bounds on M(N) as N —► 00 . 

a) Modify the construction of exercise 60(b) to show that T(m, n, s ) 0 T(s, m, n) has 
border rank < mns 0 mn 0 ns. 

b) Show that the construction of exercise 60(c) can be altered to prove that the border 
rank of T(m, n, 2s) 0 T(2n, s, m) 0 T(s, 2m, n) is at most 2(m 0 l)n(s 0 2), by 
finding constants ai, & 2 , • • •, a 6 such that 

U ^ ^ {XijUjk^ki 0 XjkYkiZij 0 Xfct 10-Zj'fc) 

= Y. + uix >« + <yu i+2 X k i){ou d - l y fk + au d Yki + 1 hj) 

v / / c£-f-"l 1/71 d — 2 \ 

x \ U Z ki 0 &Zij 0 (TU Zjk) 

+ a2 Y ( a ' ax >i + °u d+2 Ek ^i)(aifc + au d }2 k Y ki ) 

' ,1 '° X (ocaZij + u d+1 Y, h z~ki) 

+ "Y. ( crXi 3 + u<i Ek X J*){yv + (7U ' 1 1 Ek Vjk) 

'' 3 "’ X {aZr 3 + au d ~ 2 E t Z jk ) 

+ atY(° E, x 'j + u<! x'ik)(E. y-v + aud -'y,k) 

lX ° 

+ Y ( a Ei x » + « d E t x j *)(E, yv + ° ud ~ l Ek v&) 

x(<rE ^. 5 W^Ek^k) 

+ “s Y ( a Ei ^^(E, y-v)(c E. z a) 


(modulo u 2d+1 ), for sufficiently large d. Here the summations are to be carried 
out for 1 < i < m, 1 < j < n, 1 < k < s, and o = 01; the subscript i means 
ai, so that % takes the 2m values 01, ..., 0m. 

c) Use the result of (b) in connection with exercise 63 to show that M(N ) = 0(N’ /3+e ) 
for all e > 0, where (3 = /3(m, n, 2s, §(m 0 l)n(s 0 2)). 
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♦4.7. MANIPULATION OF POWER SERIES 

If WE ARE given two power series 

U{z) = U 0 + Utz + U 2 z 2 + •••, V(z) = Vo + V!Z + V 2 z 2 + ---, (1) 

whose coefficients belong to a field, we can form their sum, their product, their 
quotient, etc., to obtain new power series. A polynomial is obviously a special 
case of a power series, in which there are only finitely many terms. 

Of course, only a finite number of terms can be represented and stored 
within a computer, so it makes sense to ask whether power series arithmetic 
is even possible on computers; and if it is possible, what makes it different 
from polynomial arithmetic? The answer is that we work with only the first N 
coefficients of the power series, where N is a parameter that may in principle be 
arbitrarily large; instead of ordinary polynomial arithmetic, we are essentially 
doing polynomial arithmetic modulo z N , and this often leads to a somewhat 
different point of view. Furthermore, special operations like “reversion” can be 
performed on power series but not on polynomials, since polynomials are not 
closed under these operations. 

Manipulation of power series has several applications to numerical analysis, 
but perhaps its greatest use is the determination of asymptotic expansions (as we 
have seen in Section 1.2.11.3), or the calculation of quantities defined by certain 
generating functions. The latter applications make it desirable to calculate 
the coefficients exactly, instead of with floating point arithmetic. All of the 
algorithms in this section, with obvious exceptions, can be done using rational 
operations only, so the techniques of Section 4.5.1 can be used to obtain exact 
results when desired. 

The calculation of W(z) = U(z)± V(z) is, of course, trivial, since we have 
W n = U n ±V n for n = 0, 1 , 2, ... . It is also easy to calculate W(z) = U(z)V(z), 
using the familiar “Cauchy product rule”: 

W n = 22 U*V„_ k = U 0 V n + UiVn-! + ••• + [/„ Vo. (2) 

0 < k<n 

The quotient W(z) = U(z)/V(z), when Vo 7 ^ 0 , can be obtained by inter¬ 
changing U and W in ( 2 ); we obtain the rule 

W n =(U„- £ WicVn—k) / Vo 

^ 0 <k<n '' 

= (U n - WoVn - W X V n -l - Wn-lVj/V 0 . (3) 

This recurrence relation for the W ’a makes it easy to determine Wo, Wi, W 2 , ... 
successively, without inputting U n and V n until after W n —1 has been computed. 
Let us say that a power series manipulation algorithm with the latter property 
is “on-line”; an on-line algorithm can be used to determine N coefficients Wo, 
Wi, ..., Wn —1 of the result without knowing N in advance, so it is possible in 
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theory to run the algorithm indefinitely and compute the entire power series; or 
to run it until a certain condition is met. (The opposite of “on-line” is “off-line.”) 

If the coefficients 74 and V4 are integers but the 144 are not, the recurrence 
relation (3) involves computation with fractions. This can be avoided by the 
all-integer approach described in exercise 2 . 

Let us now consider the operation of computing W(z) = V(z) a , where a is 
an “arbitrary” power. For example, we could calculate the square root of V(z) 
by taking a = 5 , or we could find V(z )~ 10 or even V(z) rr . If V m is the first 
nonzero coefficient of V(z), we have 

V(z) = V m Z m (l -f- {Vm+l/V m )z 4“ (Vm+ 2 /Vm )^ 2 + '*•), , . 

V(z) a = C ^“"*(1 + (Vm+l/V m )z + (V m+ 2 /V m )z 2 + ■ ■ ■ )“. 

This will be a power series if and only if am is a nonnegative integer. From (4) 
we can see that the problem of computing general powers can be reduced to the 
case that Vo = 1 ; then the problem is to find coefficients of 

W(z) = (l+Viz + V 2 z 2 + V 3 z 3 + •••)“• (5) 

Clearly W 0 = 1“ = 1. 

The obvious way to find the coefficients of (5) is to use the binomial theorem 
(Eq. 1.2.9-19), or (if a is a positive integer) to try repeated squaring as in Section 
4.6.3; but a much simpler and more efficient device for calculating powers has 
been suggested by J. C. P. Miller. [See P. Henrici, JACM 3 (1956), 10-15.] If 
W(z) — V(z) a , we have by differentiation 

W, + 2 W 2 Z + 3 W 3 z 2 + ■■■ = W'(z) = aV(z) a ~ l V'(zy, ( 6 ) 

therefore 

W\z)V(z) = aW(z)V'(z). (7) 

If we now equate the coefficients of z n ~ 1 in (7), we find that 

£ kW k V n ~ k = a ( n-k)W k V n - k , (8) 

0<fc<n 0<fc<n 

and this gives us a useful computational rule valid for all n > 1 : 


w«= D 


1 < k <Ln 


a + 1 


k-l]v k W n - k 


— ((a-f-T— n)V\W n —i -\- ( 2 a-\- 2 — ri)V 2 W n —2 + • • * + naVr^j/n. (9) 


This equation leads to a simple on-line algorithm by which we can successively 
determine W\, W 2 , ..., using approximately 2 n multiplications to compute the 
nth coefficient. Note the special case a = —1, in which (9) becomes the special 
case U(z) = V 0 = 1 of (3). 

A similar technique can be used to form f(V(z )) when / is any function 
that satisfies a simple differential equation. (For example, see exercise 4.) A 
comparatively straightforward “power series method” is often used to obtain 
the solution of differential equations; this technique is explained in nearly all 
textbooks about differential equations. 
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Fig. 16. Power series reversion by Algorithm L. 


Reversion of series. The transformation of power series that is perhaps of greatest 
interest is called “reversion of series.” This problem is to solve the equation 

z = t + v 2 t 2 + V 3 t 3 + V 4 t 4 + ••• (10) 

for t, obtaining the coefficients of the power series 

t = z -(- W 2 2 2 + -j- W 4 z 4 -)-•••. ( 11 ) 

Several interesting ways to achieve such a reversion are known. We might 
say that the “classical” method is one based on Lagrange’s remarkable inversion 
formula [Memoires Acad. Royale des Sciences et Belles-Lettres de Berlin 24 
(1768), 251-326], which states that 

U n — i/rij 

if 

Uq + U\t -|- U 2 t 2 -|-= (1 -f- V2 1 -]- V3 1 2 -)-••■)~ n . (12) 

For example, we have (1 — t )~ 5 = (4) + Q)t -f §)t 2 -f-; hence W 5 in the 

reversion of z = t — t 2 is equal to (^/S = 14. This checks with the formulas 
for enumerating binary trees in Section 2 .3.4.4. 

Relation ( 12 ) shows that we can revert the series (10) if we compute the 

negative powers (i+V 2 t+V 3 t 2 - )~ n for n = 1, 2, 3, ... .A straightforward 

application of this idea would lead to an on-line reversion algorithm that uses 
approximately N 3 / 2 multiplications to find N coefficients, but Eq. (9) makes it 
possible to work with only the first n coefficients of (1 -j- + V^t 2 -j- • • •)~ n , 

obtaining an on-line algorithm that requires only about N 3 /6 multiplications. 

Algorithm L (Lagrangian power series reversion ). This on-line algorithm inputs 
the value of V n in ( 10 ) and outputs the value of W n in ( 11 ), for n = 2, 3 , 4 , 

■ - -, N. (The number N need not be specified in advance; some other termination 
criterion may be substituted.) 

LI. [Initialize.] Set n <— 1, Uq <— 1. (The relation 

(1 V 21 -|- V 3 1 2 -{-•••) n = Uq -{-U\t U n —\t n 1 -f- 0(t n ) (13) 

will be maintained throughout this algorithm.) 
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L2. [Input V^.] Increase n by 1. If n > N, the algorithm terminates; otherwise 
input the next coefficient, V n . 

L3. [Divide.] Set U k <- U k ~ U k - \V 2 - UiV k — V fc +i, for k = 1, 2, ..., 

n — 2 (in this order); then set 

U n -i <- —2U n - 2 V 2 - SUnsVs -(n - 1 piVn^ - 7lV n . 

(We have thereby divided U(z) by V(z)/z; cf. (3) and (9).) 

L4. [Output W n ] Output U n —i/n (which is W n ) and return to L2. | 


When applied to the 

example 

z = 

t — t 2 , 

Algorithm L computes 

n 

V n 

U 0 

Ui 

u 2 

Us 

U A 


1 

1 

1 





1 

2 

— 1 

1 

2 




1 

3 

0 

1 

3 

6 



2 

4 

0 

1 

4 

10 

20 


5 

5 

0 

1 

5 

15 

35 

70 

14 


Exercise 8 shows that a slight modification of Algorithm L will solve a con¬ 
siderably more general problem with only a little more effort. 

Let us now consider solving the equation 

U\Z - f- U 2 z 2 -j- U 3 Z 3 —f— * ■ * = i —(— V 2 t 2 -f- V^t 3 -\- ’ • • (14) 


for t, obtaining the coefficients of the power series 

t = WiZ + W 2 z 2 + W 32 3 + W 4 z 4 + • • •. (15) 


Eq. (10) is the special case U\ = 1, U 2 = Us = • • ■ = 0. If Ui ^ 0, we may 
assume that U\ = 1, if we replace z by ( U\z ); but we shall consider the general 
equation (14), since U\ might equal zero. 

Algorithm T ( General power series reversion). This on-line algorithm inputs the 
values of U n and V n in (14) and outputs the value of W n in (15), for n = 1,2, 3, 
..., N. An auxiliary matrix T mn , 1 < m < n < N, is used in the calculations. 
Tl. [Initialize.] Set n <— 1. Let the first two inputs (namely, U\ and Vi) be 
stored in Tu and V\, respectively. (We must have V\ — 1.) 

T2. [Output W n .} Output the value of T\ n (which is W n ). 

T3. [Input U n , V n .] Increase n by 1. If n > N, the algorithm terminates; 

otherwise store the next two inputs (namely, JJ n and V n ) in T ln and V n . 
T4. [Multiply.] Set 


Tmn Ti\T m — l,n—\ 4" T\ 2 T m — i )T i_ 2 
and Ti n T\ n — V m T mn: for 2 < m < n. (After this step we have 

t m = T mm z m + T m , m+1 z m+1 + ■ ■ ■ + T mn z n + 0(z n+1 ), (16) 


for 1 < m < n. It is easy to verify (16) by induction for m > 2, and when 

m = 1 , we have U n = T ln -{-V 2 T 2n -{ -f -V n T nn by (14) and (16).) Return 

to step T2. | 
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Equation (16) explains the mechanism of this algorithm, which is due to 
Henry C. Thacher, Jr. [CACM 9 (1966), 10-11]. The running time is essentially 
the same as Algorithm L, but considerably more storage space is required. An 
example of this algorithm is worked out in exercise 9. 

Still another approach to power series reversion has been proposed by R. P. 
Brent and H. T. Kung [ JACM 25 (1978), 581-595], based on the fact that stand¬ 
ard iterative procedures used to find roots of equations over the real numbers 
can also be applied to equations over power series. In particular, we can con¬ 
sider Newton’s method for computing approximations to a real number t such 
that f(t) — 0, given a function / that is well-behaved near t: If x is a good 
approximation to t, then 4>{x) — x — f(x)/f'(x) will be even better, for if we 
write x = t -ft we have f(x) = f(t) -f~ 4" 0(e 2 ), f'(x) — f'(t) + 0(e); 

consequently <p(x) = t -f e — (0 + ef'(t) + 0(e 2 ))/(/'(t) -f 0(e)) = t + 0(e 2 ). 
Applying this idea to power series, let f(x) = V(x) — U(z), where U and V are 
the power series in Eq. (14). We wish to find the power series t in z such that 

f(t) = 0. Let x — W\z-\ -j- W n —i z n 1 = t-j-0(z n ) be an “approximation” 

to t of order n; then <fi(x) = x — f(x)/f'(x) will be an approximation of order 2 n, 
since the assumptions of Newton’s method hold for this / and t. 

In other words, we can use the following procedure: 

Algorithm N (General power series reversion by Newton’s method). This “semi¬ 
on-line” algorithm inputs the values of U n and V n in (14) for 2 k < n < 2 fc+1 
and then outputs the values of W n in(15)for2 fc < n < 2 fc+1 , thereby producing 
its answers in batches of 2 k at a time, for k = 0, 1, 2, ..., K. 

Nl. [Initialize.] Set N 1. (We will have N = 2 fc .) Input the first coefficients 
U\ and V\ (where V\ = 1), and set W\ +- U\. 

N2. [Output.] Output W n for N < n < 2 N. 

N3. [Input.] Set N 2N. If N > 2 K , the algorithm terminates; otherwise 

input the values XJ n and V n for N < n < 2N. 

N4. [Newtonian step.] Use an algorithm for power series composition (see exer¬ 
cise 11) to evaluate the coefficients Q 3 and R 3 (0 < j < N) in the power 
series 


UiZ + • • • + U 2N ^z 2N -' - ViWiZ + • ■ • + W0v-iZ N '- 1 ) 

= R 0 z n + R i z N+1 + • ■ • + Rn^z™- 1 + 0{z 2N ), 
V'(W x z + • • • + IVjv-iz"- 1 ) = Qo + Qiz + ■ ■ • + Qn-^- 1 + 0(z N ), 

where V(x) = x -f- V 2 x 2 + ■ • • and V'{x) = 1 + 2V 2 x + • ■ •. Then set 
Wn, ..., W 2 N —1 to the coefficients in the power series 


RoH _ Ri^ _ h‘ * ’"h-Riv— 1 
Qo~\~Qi z ~\ - \~Qn—iz n ~ 1 


= W N + • • • + W 2N -iZ N ~ 1 + 0(z N ) 


and return to step N2. | 
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The running time for this algorithm to obtain the coefficients up to N — 2 K 
is T(N), where 

T(2N) = T(N) + (time to do step N4) + 0{N). (17) 

Straightforward algorithms for composition and division in step N4 will take 
order N 3 steps, so Algorithm N will run slower than Algorithm T. However, 
Brent and Kung have found a way to do the required composition of power 
series with 0(N log N) 3 / 2 arithmetic operations, and exercise 6 gives an even 
faster algorithm for division; hence (17) shows that power series reversion can 
be achieved by doing only 0(iV log TV) 3 / 2 operations as N —► oo. (On the other 
hand the constant of proportionality is such that N must be really large before 
Algorithms L and T lose out to this “high-speed” method.) 

Historical note: J. N. Bramhall and M. A. Chappie published the first 0(n 3 ) 
method for power series reversion in CACM 4 (1961), 317-318, 503. It was an 
off-line algorithm whose running time was approximately the same as that of 
Algorithms L and T. 

Iteration of series. If we want to study the behavior of an iterative process 
x n <— f{x n — i), we are interested in studying the n-fold composition of a given 
function / with itself, namely x n — /(/(... f(x o)...)). Let us define f°(x) — x 
and f n {x) = f[f n ~ 1 (x)), so that 


r +n (x ) = r(r(x)) (i 8 ) 

% 

for all integers m, n > 0. It also makes sense to define f n (x) when n is a 
negative integer, namely to let f n and f~ n be inverse functions such that x = 
/ n (/ —n (z)); then (18) holds for all integers m and n. Reversion of series is 
essentially the operation of finding the inverse function /“ 1 (x); for example, 
Eqs. (10) and (11) essentially state that 2 = V(W(z)) and that t — W(V(t)), so 
W = V~K 

Suppose we are given two power series V(z) = z -j- V 2 z 2 -(-••• and W(z) = 

z -f- W 2 Z 2 -|-such that W = V~~ 1 . Let u be any nonzero constant, and 

consider the function 

U{z) = W(uV(z)). (19) 

It is easy to see that U(U(z)) — W(u 2 V(z)), and in general that 

U n {z) = W(u n V(zj) (20) 

for all integers n. Therefore we have a simple expression for the nth iterate 
U n , which can be calculated with roughly the same amount of work for all n. 
Furthermore, we can even use (20) to define U n for non-integer values of n; the 
“half iterate” C7 1 / 2 , for example, is a function such that U 1 / 2 {U 1 ^ 2 (z)) = U(z). 
(There are two such functions I7 1//2 , obtained by using yfu and — y/u as the 
value of u 1 / 2 .) 
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We obtained the simple state of affairs in (20) by starting with V and u, 
then defining U. But in practice we generally want to go the other way: Starting 
with some given function U, we want to find V and u such that (19) holds, i.e., 
such that 

V(U{z)) = uV(z). (21) 

Such a function V is called the Schroder function of U, because it was introduced 
by Ernst Schroder in Math. Annalen 3 (1871), 296-322. Let us now look at the 

problem of finding the Schroder function V(z) = z-\-V2 z 2 -\ -of a given power 

series U = U\z -)- U2Z 2 -\ -. Clearly u = U\ if (21) is to hold. 

Expanding (21) with u = U\ and equating coefficients of z leads to a 
sequence of equations that begins 

U\V 2 + U 2 = U1V2 
t/?V 3 + 2U 1 U 2 V 2 + U 3 = IhV :3 
U\Vi + ZU\U 2 V 3 + 2U1U3V2 + U\V 2 + U 3 = U X V.i 

and so on. Clearly there is no solution when t/i = 0 (unless trivially U2 = U$ = 
... = 0); otherwise there is a unique solution unless U\ is a root of unity. We 
might have expected that something funny would happen when t/J = 1, since 
Eq. (20) tells us that U n (z) = z if the Schroder function exists. For the moment 
let us assume that Ui is nonzero and not a root of unity; then the Schroder 
function does exist, and the next question is how to compute it without doing 
too much work. 

The Mowing procedure has been suggested by R. P. Brent and J. F. Traub. 
Equation (21) leads to subproblems of a similar but more complicated form, so 
we set ourselves a more general task whose subtasks have the same form: Let us 
try to find V{z) = V 0 + Vi z 4-h V n —iz n ~ 1 such that 

V(U(z)) = W(z)V{z) + S{z) + 0(z n ), (22) 

given U(z), W(z), S(z), and n, where n is a power of 2 and [/(0) = 0. If n = 1 
we simply let Vo = S(0)/(l — W(0)), with Vo = 1 if 5(0) = 0 and W(0) = 1. 
Furthermore it is possible to go from n to 2n: First we find R(z) such that 

V(U{z)) = W(z)V(z) + S{z) - z n R{z) + 0{z 2n ). (23) 

Then we compute 

W(z) = W{z)(z/U{z)) n , S(z) = R(z)(z/U(z)) n (24) 

and find V(z) = V n + V n +i z ~\ -+ V^n—iz n ~ 1 such that 

V(U(z)) = W(z)V(z) + S(z) + 0(z n ). (25) 

It follows that V*(U(z)) = W(z)V*(z) -)- S(z) + 0(z 2n ) as desired, where V*(z) 
is equal to V(z) + z n V{z). 
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The running time T(n) of this procedure satisfies 

T(2n) = 2 T(n) + C(n), (26) 

where C(n) is the time to compute R{z), W(z ), and S(z). The function C(n) is 
dominated by the time to compute V(U(zj) modulo z 2n , and C(n) presumably 
grows faster than order n 1+€ ; therefore the solution T(ri) to (26) will be of 
order C(n). For example, if C(n) = cn 3 we have T(n) « fen 3 ; or if C(n) is 
0(n logn ) 3 / 2 using “fast” composition, we have T(n) — 0(n log n) 3 / 2 . 

The procedure breaks down when VF(0) = 1 and 5(0) 7 ^ 0, so we need to 
investigate when this can happen. It is easy to prove by induction on n that the 
solution of (22) by the Brent-Traub method entails consideration of exactly n 
subproblems, in which the coefficient of V(z ) on the right-hand side takes the 
respective values W(z)(z/U(z)) J for 0 < j < n in some order. If VF(0) == U x 
and if U\ is not a root of unity, we therefore have VF( 0 ) = 1 only when j = 1; 
the procedure will fail in this case only if ( 22 ) has no solution for n = 2 . 

The Schroder function for U can therefore be found by solving (22) for n = 
2 , 4, 8 , 16, ..., with W(z) = U x and S(z) = 0 , whenever U\ is nonzero and not 
a root of unity. 

If U\ — 1, there is no Schroder function unless U(z) = z. But Brent and 
Traub have found a fast way to compute U n (z) even when U x = 1, by making 
use of a function V{z) such that 

V(U(z)) = XJ'(z)V(z). (27) 

If two functions U(z) and U(z) both satisfy (27), for the same V, it is easy to 
check that their composition U(U(z)) does too; therefore all iterates of U(z) are 
solutions of (27). Suppose that U(z) — z -\- UkZ k Uk+iZ k + 1 -)-••• where 
k > 2 and 7 ^ 0. Then it can be shown that there is a unique power series of 

the form V(z) = z k -f- z kJrX + \ 4 _(- 2^ fc+2 H-satisfying (27). Conversely 

if such a function V(z) is given, and if k > 2 and Uk are given, then there is a 

unique power series of the form U(z) = z -(- Uk z k + -\ -satisfying 

(27); the desired iterate U n (z) is the unique power series P(z) satisfying 

V(P(z)) = P'(z)V(z) (28) 

such that P(z) = z nUkZ k + •••. Both V(z) and P(z) can be found by 
appropriate algorithms (see exercise 14). 

If U\ is a fcth root of unity, but not equal to 1 , the same method can be 
applied to the function U k {z) = z -(-•••, and U k (z ) can be found from U(z) 
by doing l(k) composition operations (cf. Section 4.6.3). We can also handle the 
case Ui = 0: If U(z) = UkZ k + -(-•■• where k > 2 and Uk 7 ^ 0, 

the idea is to find a solution to the equation V(U(z)) = UkV(z) k ; then U n (z) = 

V -i(U ikn - 1)/{k - l) V{z) kn ). Finally, if U(z) = U 0 + U x z + • • • where U 0 7 ^ 0 , 
let a be a “fixed point” such that U(a) = a, and let U(z) = U(ct-\- z)~ a = 
zU'(a) -)- z 2 U"(a)/ 2! -(-•••; then U n (z ) = U n (z — a)-\-ot. Further details can 
be found in Brent and Traub’s paper [SIAM J. Computing 9 (1980), 54-66]. 
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Algebraic functions. The coefficients of each power series W(z) that satisfies a 
general equation of the form 

A n (z)W(z) n + ■ • ■ + A 1 (z)W(z) + Ao(z) = 0. (29) 

where each Ai(z ) is a polynomial, can be computed efficiently by using methods 
due to H. T. Kung and J. F. Traub; see JACM 25 (1978), 245-260. 


EXERCISES 

1. [M10] The text explains how to divide U(z) by V(z) when Vo 7 ^ 0; how should 
the division be done when Vo — 0 ? 

2 . [20] If the coefficients of U(z) and V(z ) are integers and Vb 7 ^ 0, find a recurrence 
relation for the integers Vq ~^ 1 W n , where W n is defined by (3). How would you use this 
for power series division? 

3. [ M15] Does formula (9) give the right results when a — 0? When a = 1 ? 

► 4. [HM23] Show that simple modifications of (9) can be used to calculate e v ^ and 
ln(l -f- V(z)), when V(z) = V\Z -f- V 2 z 2 + • • •. 

5. [MOO] What happens when a power series is reverted twice; i.e., if the output of 
Algorithm L or T is reverted again? 

► 6 . [M21] (H. T. Kung.) Apply Newton’s method to the computation of W{z) = 
\}V{z), when V(0) 7 ^ 0, by finding the power series root of the equation f(x) = 0, 
where f(x) = x~ 1 — V(z). 

7. [M23] Use Lagrange’s inversion formula (12) to find a simple expression for the 
coefficient W n in the reversion of z — t — £ m . 

► 8 . [M25] Lagrange’s inversion formula can be generalized as follows: If W{z) = 
W\z -f- W 2 Z 2 + • • • = Git -)- G 2 t 2 -f- Gst 3 -)-••• = G(t), where 2 and t are related by 
Eq. (10), then W n = U n —i/n where 

Uo -)- U\Z -f - U 2 Z -)-••• = {G\ -f* 2 G 2 Z -(- 3 G 3 Z -(-••• )(1 -(- F 22 -)~ V 3 z H-) 

(Equation (12) is the special case Gi = 1, G 2 = G 3 = • • * = 0. This equation can 
be proved, for example, by using tree-enumeration formulas as in exercise 2.3.4.4 33.) 
Extend Algorithm L so that it obtains the coefficients Wi , W 2 , ... in this more general 
situation, without substantially increasing its running time. 

9. [ 11 ] Find the values of T m n computed by Algorithm T as it determines the first 
five coefficients in the reversion of z — t — t 2 . 

10 . [M20] Given that y = x a -\- aiz a+1 -)- a 2 X a+2 + • • •, a 7 ^ 0, show how to 
compute the coefficients in the expansion x = y l ^ a + b 2 y 2/a + b 3 y 3 ^ a + • • ■. 

► 11. [ M25 ] ( Composition of power series.) Let 

U(z) = Uo + Uiz -f U 2 z 2 - and V(z) = ViZ + V 2 2 2 + V 3 z 3 + • • •; 

design an algorithm that computes the first N coefficients of U(V(z)). 
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12. [M20] Find a connection between polynomial division and power series division: 
Given polynomials u(x ) and v(x ) of respective degrees m and n over a field, show how 
to find the polynomials q(x), r(x) such that u(x) = q(x)v(x) + r(x) and deg(r) < n, 
using only operations on power series. 

13. [M27] ( Rational function approximation.) It is occasionally desirable to find 
polynomials whose quotient has the same initial terms as a given power series. For 
example, if W{z) — 1 + 0 + 3z 2 -f 7z 3 -j- • ■ •, there are essentially four different ways 
to express W{z) as Wi{z)/u> 2 {z) -f 0(z 4 ) where Wi(z) and w 2 {z) are polynomials with 
deg(iui) + deg(w 2 ) < 4: 

(1 z + 3z 2 + 7z 3 ) / 1 = 1 + z + 3z 2 + 7z 3 + Qz 4 + • • •, 

(3-4z + 2 z 2 ) / (3 — Iz) = 1 + z + 3z 2 + 7z 3 + f z 4 + • • ■, 

(1 — z) / (l — 2 z — z 2 ) = 1 -|- z -f- 3z 2 -(- 7z 3 17z 4 -f- • • ■, 

1 / (1 — z — 2z 2 — 2z 3 ) = 1 + z + 3z 2 -f 7z 3 + 15z 4 -. 


Rational functions of this kind are commonly called Pade approximations, since they 
were studied extensively by H. E. Pade [Annales Scient. de 1’EcoIe Normale Superieure 
(3) 9 (1892), S1-S93]. 

Show that all Pade approximations W{z) = u>i(z)/w 2 (z) 0(z N ) with deg(wi)-|- 

deg(xu 2 ) < N can be obtained by applying an extended Euclidean algorithm to the 

polynomials z N and Wo-{-W\Z-\ -(-VF/v—iz Ar—1 ; and design an all-integer algorithm 

for the case that each Wi is an integer. [Hint: See exercise 4.6.1-26.] 

► 14. [ HM30 ] Fill in the details of Brent and Traub’s method for calculating U n (z ) 
when U(z) ~ z Uk z k , using (27) and (28). 


And it shall be, 

when thou hast made an end of reading this book , 
that thou shalt bind a stone to it, 
and cast it into the midst of Euphrates. 


—Jeremiah 51:63 


ANSWERS TO EXERCISES 


This branch of mathematics is the only one , T believe, 
in which good writers frequently get results entirely erroneous. 

... It may be doubted if there is a single 
extensive treatise on probabilities in existence 
which does not contain solutions absolutely indefensible. 

—C. S. PEIRCE, in Popular Science Monthly (1878) 


NOTES ON THE EXERCISES 

1. An average problem for a mathematically inclined reader. 

3. See W. J. LeVeque, Topics in Number Theory 2 (Reading, Mass.: Addison-Wesley, 
1956), Chapter 3. [Note: One of the men who read a preliminary draft of the manuscript 
of this book reported that he had discovered a truly remarkable proof, which the margin 
of his copy was too small to contain.] 


SECTION 3.1 

1 . (a) This will usually fail, since “round” telephone numbers are often selected by 
the telephone user when possible. In some communities, telephone numbers are perhaps 
assigned randomly. But it would be a mistake in any case to try to get several successive 
random numbers from the same page, since the same telephone number is often listed 
several times in a row. 

(b) But do you use the left-hand page or the right-hand page? Say, use the left- 
hand page number, divide by 2, and take the units digit. The total number of pages 
should be a multiple of 20; but even so, this method will have some bias. 

516 
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(c) The markings on the faces will slightly bias the die, but for practical purposes 
this method is quite satisfactory (and it has been used by the author in the preparation 
of several examples in this set of books). See Math. Comp. 15 (1961), 94-95, for further 
discussion of these dice. 

(d) (This is a hard question thrown in purposely as a surprise.) The number is 
not quite uniformly random. If the average number of emissions per minute is m, the 
probability that the counter registers fc is e~ m m k /k\ (the Poisson distribution); so the 
digit 0 is selected with probability e~ m J2k>o m. 1Ote /(10 k)\, etc. The units digit will be 

even with probability e~ m coshm = ^ -f- ^ e~~ 2m , and this is never equal to \ (although 
the error is negligibly small when m is large). 

(e) Okay, provided that the time since the previous digit selected in this way is 
random. However, there is possible bias in borderline cases. 

(f,g) No, people usually think of certain digits (like 7) with higher probability. 

(h) Okay; your assignment of numbers to the horses had probability ^ of assigning 
a given digit to the winning horse. 

2. The number of such sequences is the multinomial coefficient 1000000!/(1000001) 10 ; 
the probability is this number divided by io 1000000 , the total number of sequences of 
a million digits. By Stirling’s approximation we find that the probability is close to 
l/( 167 r 4 10 22 \/ 27 r) sa 2.55 X 10~ 26 , about one chance in 4 X 10 25 . 

3. 3040504030. 

4. Step Kll can be entered only from step K10 or step K 2 , and in either case we find 
it impossible for X to be zero by a simple argument. If X could be zero at that point, 
the algorithm would not terminate. 

5. Since only 10 10 ten-digit numbers are possible, some value of X must be repeated 
during the first 10 10 —(-1 steps; and as soon as a value is repeated, the sequence continues 
to repeat its past behavior. 

6 . (a) Arguing as in the previous exercise, the sequence must eventually repeat a 
value; let this repetition occur for the first time at step p -f- X, where 

(This condition defines p and X.) We have 0 < p < m, 0 < X < m, p-\- \ < m. The 
values p = 0, X = m are attained iff / is a cyclic permutation; and p = m~- 1, X = 1 
occurs, e.g., if No = 0, f(x) = x -f-1 for x < m — 1, and f(m — 1) = m — 1. 

(b) We have, for r > n, X r = X n iff r — n is a multiple of X and n > p. 
Hence X 2 n = X n iff n is a multiple of X and n > p. The desired results now follow 
immediately. [Note: This is essentially a proof of the familiar mathematical result that 
the powers of an element in a finite semigroup include a unique idempotent element: 
take Xo = a, f(x) = ox.] 

(c) Once n has been found, generate Xi and X n +x for i > 0 until first finding 
Xi = X n -\-i', then p = i. If none of the values of X n+l for 0 < i < p is equal to X n , 
it follows that X = n, otherwise X is the smallest such i. 

7. (a) The least n > 0 such that n — {£{n) — 1) is a multiple of X and l{ri) — 1 > M 
is n ~ 2 ' lgmax ^ +1 ’ x )l — 1 -f X. [This may be compared with the least n > 0 such 
that X 2 n = X n , namely X<5 M 0 p-\-\ — 1 — ((/z -f- X — 1 ) mod X).] 

(b) Start with X — Y — X 0 , k = m — 1. (At key places in this algorithm we 
will have X = X 2 m—k—i, Y = X m —i, and m = t(2m — fc).) To generate the next 
random number, do the following steps: Set X <— f{X) and fc <— k — 1. If X = Y, 
stop (the period length X is equal tom — fc). Otherwise if fc = 0, set Y +— X, m <— 2m, 
fc <— m. Output X. 
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Notes: Brent has also considered a more general method in which the successive 
values of Y — X ni satisfy n 1 = 0, riiy i = 1 + [pni\ where p is any number greater 
than 1. He showed that the best choice of p, approximately 2.4771, saves about 3 
percent of the iterations by comparison with p — 2 . (See exercise 4. 5 .4-4.) 

The method in part (b) has a serious deficiency, however, since it might generate 
a lot of nonrandom numbers before shutting off. For example, we might have a 
particularly bad case such as X = 1 , p = 2 fc . A method based on Floyd’s idea in 
exercise 6 (b), namely one that maintains Y = X 2 n and X = X n for n = 0, 1, 2, 
..., will require a few more function evaluations than Brent’s method, but it will stop 
before any number has been output twice. 

On the other hand if / is unknown (e.g., if we are receiving the values X 0 , Xi, 
on-line from an outside source) or if / is difficult to apply, the following cycle detection 
algorithm due to R. W. Gosper will be preferable: Maintain an auxiliary table To, Ti, 
..., T m , where m = [lgnj when receiving X n . Initially To <— Xo. For n = 1, 2, 
..., compare X n with each of To, ..., T|_i gn j; if no match is found, set T e ( n ) +— X n , 
where e(n) = max{ e \ 2 e divides n + 1 }• But if a match X n — Tfc is found, then 
\ — n — max{m j m < n and e(m) = k}. This procedure does not stop before 
generating a number twice, but at most f§X"| of the X’s will have occurred more than 
once. [MIT AI Laboratory Memo 239 (Feb. 29, 1972), Hack 132.] 

See also R. Sedgewick and T. G. Szymanski, Proc. ACM Symp. Th. Comp. 11 
(1979), 74-80. 

8 . (a,b) 00,00,... [62 starting values]; 10,10,... [19]; 60,60,... [15]; 50,50,... [1]; 
24,57,24,57,... [3]. (c) 42 or 69; these both lead to a set of fifteen distinct values, 
namely (42 or 69), 76, 77, 92, 46, 11, 12, 14, 19, 36, 29, 84, 05, 02, 00. 

9. Since X < b n , we have X 2 < b 2n , and the middle square is [X 2 /b n J < X 2 /b n . 
If X > 0, then X 2 /b n < Xb n /b n = X. 

10 . If X — ab n , the next number of the sequence has the same form; it is (a 2 mod b n )b n . 
If a is a multiple of b or of all the prime factors of b, the sequence will soon degenerate 
to zero; if not, the sequence will degenerate into a cycle of numbers having the same 
general form as X. 

Further facts about the middle-square method have been found by B. Jansson, 
Random Number Generators (Stockholm: Almqvist & Wiksell, 1966), Section 3 A. 
Numerologists will be interested to learn that the number 3792 is self-reproducing in the 
four-digit middle-square method, since 3792 2 = 14379264; furthermore (as Jansson has 
observed), it is “self-reproducing” in another sense, too, since its prime factorization is 
3 • 79 • 2 4 ! 

11. The probability that X = 1 and yu = 0 is the probability that Xi = Xo, namely 

1/m. The probability that X = l,p = 1, or X = 2,p = 0, is the probability that 
Xi 7 ^ Xo and that X 2 has a certain value, so it is (1 — l/m)(l/m). Similarly, the 
probability that the sequence has any given p and X is a function only of namely 

p ^=b n 

1 fC k <C 

For the probability that X = 1, we have 

i n f 1 - 1 

M >0 1<A :<fj. v 
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where Q(m) is defined in Section 1.2.11.3, Eq. (2). By Eq. (25) in that section, the 
probability is approximately b\Jr:/2m 1.25 /\fm. The chance of Algorithm K 

converging as it did is only about one in 80000; the author was decidedly unlucky. 
But see exercise 15 for further comments on the “colossalness.” 


12. £ xP(M,X) = i(l + 3(l-i) + 6(l-i)(l-A) + ...) 

1 <X <m 
0 </x<m 

= 1 + Q{m) 

2 

(See the previous answer. In general if f(ao, Oi,...) = X)n>o 0/71 III <*<«(■*■ — k/m) 
then f(a 0 , ai,...) = a 0 + /(ai, a 2 , ...) — /(ai, 2 a 2 ,... )/m; apply this identity with 
a n = (n -f- l)/ 2 .) Therefore the average value of X (and, by symmetry of P(p, X), 
also of p -f- 1 ) is approximately \fwrnj% + 3 . The average value of -|- X is exactly 

Q(ra), approximately \prmf2 — 3 . [For alternative derivations and further results, 
including asymptotic values for the moments, see A. Rapoport, Bull Math. Biophysics 
10 (1948), 145-157, and B. Harris, Annals Math. St at. 31 (1960), 1045-1062; see also 
I. M. Sobol’, Theory of Probability and its Applications 9 (1964), 333-338. Sobol’ 
discusses the asymptotic period length for the more general sequence X n ~\-i — /(X n ) if 
n ^ 0 (modulo m); X n +i = g{X n ) if n = 0 (modulo m); with both / and g random.] 

13. [Paul Purdom and John Williams, Trans. Amer. Math. Soc. 133 (1968), 547-551.] 
Let Tmn be the number of functions that have n one-cycles and no cycles of length 
greater than one. Then 



(See Section 2.3.4.4.) Any function is such a function followed by a permutation of the 
n elements that were the one-cycles. Hence J^n>i ^' rnn n ' = mTn - 

Let P n k be the number of permutations of n elements in which the longest cycle 
is of length k. Then the number of functions with a maximum cycle of length k is 
Sn>i T ^nPnk. To get the average value of k, we compute X^>i E n >i kT mn Pnk, 
which by the result of exercise 1.3.3-23 is ^ n >i n!(n \ +€ n )c where c ^ .62433 
and e n —► 0 as n —*■ 00 . Summing, we get the average value cQ(m ) -(- \c -(- 6 m , where 
6m —>■ 0 as m —> 00 . (This is not substantially larger than the average value when Xo 
is selected at random. The average value of max(// -|- X) is still unknown.) 

14. Let c r (m ) be the number of functions with exactly r different final cycles. From 
the recurrence ci(ra) = (m — 1)! - J2k >0 (T)(—l) fc (rr2. — k) k Ci(m — fc), which comes 
by counting the number of functions whose image contains at most m — k elements, 
we find the solution ci(m) = m m_ 1 Q(m). (Cf. exercise 1.2.11.3-16.) Another way 
to obtain the value of ci(ra), which is perhaps more elegant and revealing, is given in 
exercise 2.3.4.4-17. The value of c r (m) may be determined by solving the recurrence 


Cr(m) — 



1 (m 


k), 


which has the solution 


m—^1 


0 ! 


1 

, 1 

2 

m — 1 , 1 

3 






r 

1 ! 

r 

m 2 ! 

. r . 


m — 1 m — 2 
m m 



c r (m ) = m 
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The desired average value can now be computed; it is (cf. exercise 12) 



This latter formula was obtained by quite different means by Martin D. Kruskal, AMM 
61 (1954), 392-397. Using the integral representation 




dx 

X 


he proved the asymptotic relation lim m _ > 0 o(-E , m — | lnra) = 5(7 + In 2). For further 
results and references, see John Riordan, Annals Math. Stat. 33 (1962), 178-185. 


15. The probability that f(x) 7 ^ x for all x is (m—l) m /ra, which is approximately 1/e. 
The existence of a self-repeating value in an algorithm like Algorithm K is therefore 
not “colossal” at all—it occurs with probability 1 — 1 /e m .63212. The only “colossal” 
thing was that the author happened to hit such a value when Xq was chosen at random 
(see exercise 11 ). 


16. The sequence will repeat when a pair of successive elements occurs for the second 
time. The maximum period is m 2 . (Cf. next exercise.) 


17. After selecting X 0 , ..., Xk—i arbitrarily, let X n _|_i = f(X n ,..., X n — fc+i), where 
0 < xi ,..., Xk < m implies that 0 < fix 1 ,..., Xk) < m. The maximum period is m fc . 
This is an obvious upper bound, but it is not obvious that it can be attained; for a 
proof that it can always be attained for suitable /, see exercises 3.2.2-17 and 3.2.2-21, 
and for the number of ways to attain it see exercise 2.3.4.2-23. 


18. Same as exercise 7, but use the /c-tuple of elements ( X n ,... , X n _fc+i) in place of 
the single element X n . 


SECTION 3.2.1 

1 . Take Xo even, a even, c odd. Then X n is odd for n > 0. 

2 . Let X r be the first repeated value in the sequence. If X r were equal to Xk for some 
k where 0 < k < r, we could prove that X r —i = Xk— 1 , since X n uniquely determines 
X n _i when a is prime to m. Hence k = 0. 

3. If d is the greatest common divisor of a and m, the quantity aX n can take on at 
most m/d values. The situation can be even worse; e.g., if m = 2 e and if a is even, 
Eq. (6) shows that the sequence is eventually constant. 

4. Induction on k. 

5. If a is relatively prime to m, there is a number a' for which aa! = 1 (modulo m). 
Then X n —\ = (a'X n — a'c)modm, and in general, 

X n —k — ((a / ) fc X n — c{a! + • • • + (gO*)) m °d m 
= ( (a?X „-c(^5^) m od ra 

when k > 0, n — k > 0. If a is not relatively prime to m, it is not possible to determine 
X n —1 when X n is given; multiples of m/gcd(a, m) may be added to X n —1 without 
changing the value of X n . (See also exercise 3.2.1.3-7.) 
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SECTION 3.2.1.1 

1. Let c' be a solution to the congruence ac' = c (modulo m). (Thus, c' — a'cmod m , 
if a' is the number in the answer to exercise 3.2.1-5.) Then we have 

LDA X 

ADD CPRIME 

MUL A | 

Overflow is possible on this addition operation. (From results derived later in the 
chapter, it is probably best to save a unit of time, taking c = a and replacing the ADD 
instruction by “INCA 1”. Then if Xo = 0, overflow will not occur until the end of the 
period, so it won’t occur in practice.) 

2. RANDM STJ IF 

LDA XRAND 

MUL A 

SLAX 5 

ADD C (or, INCA c, if c is small) 

STA XRAND 

1H JNOV * 

JMP *-l 

XRAND CON X 0 

A CON a 

C CON c | 

Note: Locations A and C should probably be named 2H and 3H to avoid conflict with 
other symbols, if this subroutine is to be used by other programmers. 

3. See WM1 at the end of Program 4.2.3D. 

4. Define the operation x mod 2 e = y iff x ~ y (modulo 2 e ) and —2 e—1 < y < 2 e ~ l . 
The congruential sequence ( Y n ) defined by 

Yo Xo mod 2 32 , = ( aY n -4- c) mod 2 32 

is easy to compute on 370-style machines, since the lower half of the product of y 
and 0 is ( yz ) mod 2 32 for all two’s complement numbers y and z, and since addition 
ignoring overflow also delivers its result mod 2 32 . This sequence has all the random¬ 
ness properties of the standard linear congruential sequence (X n ), since Y n = X n 
(modulo 2 32 ). Indeed, the two’s complement representation of Y n is identical to the 
binary representation of X n , for all n. [G. Marsaglia and T. A. Bray first pointed this 
out in CACM 11 (1968), 757-759.] 

5. (a) Subtraction: LDA X; SUB Y; JANN *+2; ADD M. (b) Addition: LDA X; SUB M; 
ADD Y; JANN *+2; ADD M. (Note that if m is more than half the word size, the 
instruction “SUB M” must precede the instruction “ADD Y” .) 

6. The sequences are not essentially different, since adding the constant (m — c ) has 
the same effect as subtracting the constant c. The operation must be combined with 
multiplication, so a subtractive process has little merit over the additive one (at least 
in Mix’s case), except when it is necessary to avoid affecting the overflow toggle. 
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7. The prime factors of z k — 1 appear in the factorization of z kr — 1. If r is odd, 
the prime factors of z k -f-1 appear in the factorization of z kr + 1. And z 2k — 1 equals 
{z k - 1 )(z k + 1). 

8. JOV *+l (Ensure that overflow is off.) 

LDA X 

MUL A 
STX TEMP 

ADD TEMP Add lower half to upper half. 

JNOV *+2 If > w, subtract w — 1. 

INCA 1 (Overflow is impossible in this step.) | 

Note: Since addition on an e-bit ones’-complement computer is mod (2 e — 1), it is 
possible to combine the techniques of exercises 4 and 8, producing (yz) mod(2 e — 1) 
by adding together the two e-bit halves of the product yz, for all ones’ complement 
numbers y and z regardless of sign. 

9. The pairs (0, w — 2), (1 ,w — 1) are treated as equivalent in input and output: 


JOV 

*+l 


LDA 

X 


MUL 

A 

aX = qw -\- r. 

SLC 

5 

rA «— r, rX <— q. 

STX 

TEMP 


ADD 

TEMP 


JNOV 

*+2 

Get (r -(- g)mod(u; — 2). 

INCA 

2 

Overflow is impossible. 

ADD 

TEMP 


JNOV 

*+3 

Get (r -(- 2 q) mod (w — 2). 

INCA 

2 

Overflow is possible. 

JOV 




SECTION 3.2.1.2 

1. Period length m, by Theorem A. (Cf. exercise 3.) 

2. Yes, these conditions imply the conditions in Theorem A, since the only prime 
divisor of 2 e is 2, and any odd number is relatively prime to 2 e . (In fact, the conditions 
of the exercise are necessary and sufficient.) 

3. By Theorem A, we need a = 1 (modulo 4) and a = 1 (modulo 5). By Law D of 
Section 1.2.4, this is equivalent to a = 1 (modulo 20). 

4. We know X 2 e—i = 0 (modulo 2 e ~ 1 ) by using Theorem A in the case m — 2 e—1 . 
Also using Theorem A for m — 2 e , we know that X 2 e—i ^ 0 (modulo 2 e ). It follows 
that X 2 e—i = 2 e— 1 . More generally, we can use Eq. 3.2.1-6 to prove that the second 
half of the period is essentially like the first half, since X n+2 e ~ l = {X n -\-2 e ~ ^mod 2 e . 
(The quarters are similar too, see exercise 21.) 

5. We need a = 1 (modulo p) for p = 3,11,43,281,86171. By Law D of Section 
1.2.4, this is equivalent to a = 1 (modulo 3 • 11 • 43 • 281 • 86171), so the only solution 
is the terrible multiplier a = 1. 
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6. (Cf. previous exercise.} a = 1 (modulo 3 • 7 • 11 • 13 • 37) implies that the solutions 
are a = 1 -f- 111111/c, for 0 < k < 8. 

7. Using the notation of the proof of Lemma Q, fi is the smallest value such that 
X M -)_x = Xfj,; so it is the smallest value such that y M +x = and Z^+x = Z 
This shows that fi = max(p,i,..., p, t ). The highest achievable p is max(ei,..., e t ), but 
nobody really wants to achieve it. 

8. We have a 2 = 1 (modulo 8); so a 4 = 1 (modulo 16), a 8 = 1 (modulo 32), etc. If 
a mod 4 = 3, then a —1 is twice an odd number; so ( a 2 —1)/2 = 0 (modulo 2 e+1 /2), 
and this yields the desired result. 

9. Substitute for X n in terms of Y n and simplify. If Xo mod 4 = 3, the formulas 
of the exercise do not apply; but they do apply to the sequence Z n = (—X n )mod2 e , 
which has essentially the same behavior. 

10. Only m — 1, 2, 4, p e , and 2 p e , for odd primes p. In all other cases, the result of 
Theorem B is an improvement over Euler’s theorem (exercise 1.2.4-28). 

11. (a) Either x —(— 1 or x — 1 (not both) will be a multiple of 4, so x^l = q2 ^, where q 
is odd and / is greater than 1. (b) In the given circumstances, / < e and so e > 3. We 
have ±x = 1 (modulo 2 f ) and ±x ^ 1 (modulo 2 /+1 ) and / > 1. Hence by applying 
Lemma P, we find that {±x) 2e ~ f ~ 1 ^ 1 (modulo 2 e ), while x 2e ~ f = {±x) 2e ~ f = 1 
(modulo 2 e ). So the order is a divisor of 2 e ~ f , but not a divisor of 2 e ~ f ~ 1 . (c) 1 has 
order 1; 2 e — 1 has order 2; the maximum period when e > 3 is therefore 2 e ~ 2 , and 
for e > 4 it is necessary to have f = 2, that is, x = 4 ^ 1 (modulo 8). 

12. If k is a proper divisor of p — 1 and if a k = 1 (modulo p), then by Lemma P 

we have a kp = 1 (modulo p e ). Similarly, if a p ~ 1 = 1 (modulo p 2 ), we find that 
fl ( P — i)p _ i (modulo p e ). So in these cases a is not primitive. Conversely, if 
a p ~ 1 ^ 1 (modulo p 2 ), Theorem 1.2.4F and Lemma P tell us that a ^ p ~^ p£ ^ 1 

(modulo p e ), but cS p ~ 1 ' )p = 1 (modulo p e ). So the order is a divisor of (p — ljp 6 ” 1 
but not of (p — l)p e—2 ; it therefore has the form kp e ~ x , where k divides p — 1. But 

if a is primitive modulo p, the congruence a kj>e = a k = 1 (modulo p) implies that 

k = p — 1. 

13. Let X be the order of a modulo p. By Theorem 1.2.4F, X is a divisor of p — 1. If 
X < p — 1, then {p — 1)/X has a prime factor, q. 

14. Let 0 < k < p. If a p ~ 1 = 1 (modulo p 2 ), then (a -f kp) v ~ l = a p ~ 1 -|- (p — l)x 
a v ~ 2 kp (modulo p 2 ); and this is ^ 1, since (p — 1 )a p ~ 2 k is not a multiple of p. By 
exercise 12, a + kp is primitive modulo p e . 

15. (a) If Xi =p\ 1 ... p e t l , X 2 =p{ 1 ... p { 1 , let /ci = pf 1 ... p g t l ,k 2 = Pi 1 ... Pt *, where 

9j = e 3 and h 3 = 0, if e 3 < f 3 , 

g 3 — 0 and h 3 = f 3 , if e 3 > f 3 . 

Now ai 1 ,^ 2 have periods X 1 //C 1 , \ 2 /k 2 , and the latter are relatively prime. Further¬ 
more (X 1 //C 1 XX 2 //C 2 ) = X, so it suffices to consider the case when Xi is relatively prime 
to X 2 , that is, when X = X 1 X 2 . Now since (ai< 22 ) x = 1, we have 1 = («ia 2 ) XXl ~ 
hence XXi is a multiple of X 2 . This implies that X is a multiple of X 2 , since Xi is 
relatively prime to X 2 . Similarly, X is a multiple of Xi; hence X is a multiple of XiX 2 - 
But obviously (aia 2 ) XlX2 = 1, so X = XiX 2 . 

(b) If ai has order X(m) and if a 2 has order X, by part (a) X(m) must be a multiple 
of X, otherwise we could find an element of higher order, namely of order lcm(X, X(m)). 
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16. (a) f(x) = (x- a)(x n_1 + (a + Cl )x n ~ 2 + • ■ • + (a 71 " 1 + • • • + c n -i)) + /(a), 
(b) The statement is clear when n = 0. If a is one root, /(x) = (x — a)q(x ); therefore 
if a' is any other root, 

0 _ f(a') ee (a 1 — a)qr(a'), 

and since a' — a is not a multiple of p, a' must be a root of q(x). So if f(x) has more 
than n distinct roots, q(x) has more than n — 1 distinct roots, (c) \(p) > p — 1, since 
f(x) must have degree > p — 1 in order to possess so many roots. But by Theorem 
1.2.4F, X(p) < p— 1. 

17. By Lemma P, ll 5 ~ 1 (modulo 25), ll 5 ^ 1 (modulo 125), etc.; so the order of 
11 is 5 e—1 (modulo 5 e ), not the maximum value X(5 e ) = 4 • 5 e—1 . But by Lemma Q 
the total period length is the least common multiple of the period modulo 2 e (namely 
2 e—2 ) and the period modulo 5 e (namely 5 e— *), and this is 2 e—2 5 e—1 = X(10 e ). The 
period modulo 5 e may be 5 e—1 or 2 • 5 e—1 or 4 • 5 e— *, without affecting the length 
of period modulo 10 e , since the least common multiple is taken. The values that are 
primitive modulo 5 e are those congruent to 2, 3, 8, 12, 13, 17, 22, 23 modulo 25 (cf. 
exercise 12), namely 3, 13, 27, 37, 53, 67, 77, 83, 117, 123, 133, 147, 163, 173, 187, 197. 

18. According to Theorem C, a mod 8 must be 3 or 5. Knowing the period of a 
modulo 5 and modulo 25 allows us to apply Lemma P to determine admissible values 
of a mod 25. Period = 4 ■ 5 e_1 : 2, 3, 8, 12, 13, 17, 22, 23; period = 2 • 5 e_1 : 4, 9, 
14, 19; period = 5 e—1 : 6, 11, 16, 21. Each of these 16 values yields one value of a, 
0 < a < 200, with a mod 8 = 3, and another value of a with a mod 8 = 5. 

19. One example appears in Table 3.3.4-1, line 26. 

20. (a) We have AY n + Xo = AY n -\-k + Xo (modulo m) iff Y n = Y n +k (modulo m'). 
(b)(i) Obvious, (ii) Theorem A. (iii) (a n — l)/(a — 1) = 0 (modulo 2 e ) iff a n = 1 
(modulo 2 e+1 ); if a ^ —1, the order of a modulo 2 e+1 is twice its order modulo 2 e . 
(iv) (a n — l)/{a — 1) = 0 (modulo p e ) iff a n = 1. 

21. X n +s = X n + X 3 by Eq. 3.2.1-6; and s is a divisor of m, since s is a power of p 
when m is a power of p. Hence a given integer q is a multiple of m/s iff X qs = 0 iff q 
is a multiple of m/gcd(X s , m). 


SECTION 3.2.1.3 

1. c = 1 is always relatively prime to B 5 ; and every prime dividing m = B 5 is a 
divisor of B, so it divides b — B 2 to at least the second power. 

2. Only 3, so the generator is not recommended in spite of its long period. 

3. The potency is 18 in both cases (see next exercise). 

4. Since a mod 4 — 1, we must have a mod 8 = 1 or 5, so b mod 8 = 0 or 4. If b is an 
odd multiple of 4, and if iq is a multiple of 8, clearly b 3 = Q (modulo 2 C ) implies that 
bf = 0 (modulo 2 e ), so b i cannot have higher potency. 

5. The potency is the smallest value of s such that fjS > ej for all j. 

6. The modulus must be divisible by 2 7 or by p 4 (for odd prime p) in order to have 
a potency as high as 4. The only values are m = 2 27 —|- 1 and 10 9 — 1. 

7. a' = (1 — b b 2 — ■ • -)modm, where the terms in 6 s , 6 s- * -1 , etc., are dropped (if 
s is the potency). 
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8. Since X n is always odd, 

X n+ 2 = (2 34 -j- 3 ■ 2 18 9)X n mod 2 35 = (2 34 + 6X n+ i — 9X n )mod 2 35 . 

Given Y n and Y n +i, the possibilities for 

y n+2 as (5 4- 6(Y n +i + Cl) - 9(Y n + e 2 )) mod 10, 

with 0 < €i < 1, 0 < €2 < 1, are limited and nonrandom. 

Note: If the multiplier suggested in exercise 3 were, say, 2 33 -f- 2 18 + 2 2 -(-1, instead 
of 2 23 2 14 + 2 2 + 1, we would similarly find X n +2 — 10X n +i -f 25X n = constant 

(modulo 2 35 ). In general, we do not want a ^ 6 to be divisible by high powers of 2 
when 6 is small, else we get “second order impotency.” See Section 3.3.4 for a more 
detailed discussion. 

The generator that appears in this exercise is discussed in an article by MacLaren 
and Marsaglia, JACM 12 (1965), 83-89. The deficiencies of such generators were first 
demonstrated by M. Greenberger, CACM 8 (1965), 177-179. Yet generators like this 
were still in widespread use more than ten years later (cf. the discussion of RANDU in 
Section 3.3.4). 


SECTION 3.2.2 

1. The method is useful only with great caution. In the first place, aU n is likely to be 
so large that the addition of elm that follows will lose almost all significance, and the 
“mod 1” operation will nearly destroy any vestiges of significance that might remain. 
We conclude that double-precision floating point arithmetic is necessary. Even with 
double precision, one must be sure that no rounding, etc., occurs to affect the numbers 
of the sequence in any way, since that would destroy the theoretical grounds for the 
good behavior of the sequence. (But see exercise 23.) 

2. X n + 1 equals either X n — i + X n or X n —i X n — m. If X n +i < X n we must 
have X n +i = X n —i + X n — m; hence X n +i < X n -i- 

3. (a) The underlined numbers are V[j] after step M3. 


Output 

initial 

0 4 5 6 2 0 3(2 7 4 1 6 3 0 5) and repeats. 

V[0] 

0 

477777774777777747... 

VT] 

3 

333333255555552555... 

l/[2] 

2 

222203333333033333... 

V[3] 

5 

556111111161111111... 

X 


476103254761032547... 

Y 


016745230167452301... 


So the potency has been reduced to 1! (See further comments in the answer to 
exercise 15.) 
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(b) The underlined numbers are V\j] after step B2. 


Output 

initial 

2 

3 

6 

5 

7 

0 

0 

5 

3 . 

. 4 

6 

(3 

0 . 

. 4 

7)... 

V[0] 

0 

0 

0 

0 

0 

0 

0 

5 

4 

4 . 

. 1 

1 

1 

1 . 

. 1 

1 ... 

V[1] 

3 

3 

6 

1 

1 

1 

1 

1 

1 

1 . 

. 0 

0 

0 

4 . 

. 0 

0 ... 

V[ 2 ] 

2 

1 

7 

7 

7 

3 

3 

3 

3 

1 . 

. 6 

2 

2 

2 . 

. 7 

2 ... 

F[3] 

5 

5 

5 

5 

0 

0 

2 

2 

2 

2 . 

■ 3 

3 

5 

5 . 

■ 3 

3 ... 

X 

4 

7 

6 

1 

0 

3 

2 

5 

4 

7 . 

. 3 

2 

5 

4 . 

. 3 

2 ... 


In this case the output is considerably better than the input; it enters a repeating cycle 
of length 40 after 46 steps: 236570 05314 72632 40110 37564 76025 12541 73625 03746 
(30175 24061 52317 46203 74531 60425 16753 02647). The cycle can be found easily by 
applying the method of exercise 3.1-7 to the above array until a column is repeated. 

4. The low-order byte of many random sequences (e.g., linear congruential sequences 
with m — word size) is much less random than the high-order byte. See Section 3.2. 1 . 1 . 

5. The randomizing effect would be quite minimized, because V[j] would always 
contain a number in a certain range, essentially j/k < V[j]/m < (j l)//c. However, 
some similar approaches could be used: we could take Y n = X n —i, or we could choose j 
from X n by extracting some digits from about the middle instead of at the extreme 
left. None of these suggestions would produce a lengthening of the period analogous 
to the behavior of Algorithm B. 

6 . For example, if X n /ra < 5 , then X n +i = 2X n . 

7. [W. Mantel, Nieuw Arehief voor Wiskunde (2) 1 (1897), 172-184.] 

00 . ..01 

The subsequence of 00.. .10 

X values: . . . 

10 . ..00 
CONTENTS(A) 

8 . We may assume that Xo — 0 and m = p e , as in the proof of Theorem 3.2.1.2A. 
First suppose that the sequence has period length p e ; it follows that the period of 
the sequence mod p? has length p ^, for 1 < / < e, otherwise some residues mod p* 
would never occur. Clearly, c is not a multiple of p, for otherwise each X n would 
be a multiple of p. If p < 3, it is easy to establish the necessity of conditions (iii) 
and (iv) by trial and error, so we may assume that p > 5. If d ^ 0 (modulo p) then 
dx 2 ax -f- c = d(x + ai) 2 + ci (modulo p e ) for some integers ai and c\ and for all 
integers x; this quadratic takes the same value at the points x and — x — 2ai, so it 
cannot assume all values modulo p e . Hence d = 0 (modulo p); and if a ^ 1, we would 
have dx 2 -j-ax + c = x (modulo p) for some x, contradicting the fact that the sequence 
mod p has period length p. 

To show the sufficiency of the conditions, we may assume by Theorem 3.2.1.2A 
and consideration of some trivial cases that m = p e where e > 2. If p — 2, we have 
X n + P = X n -j-pc (modulo p 2 ), by trial; and if p = 3, we have either X n + P = X n -\-pc 
(modulo p 2 ), for all n, or X n +p = X n — pc (modulo p 2 ), for all n. For p > 5, we 
can prove that X n+P = X n + P c (modulo p 2 ): Let d — pr, a = 1 -f- ps. Then if 


becomes: 


00 .. .01 
00...10 

10...00 
00 .. .00 
CONTENTS(A) 
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X n = cn -f- pY n (modulo p 2 ), we must have Y n +i = n 2 c 2 r ncs -f- Y n (modulo p); 
hence Y n ~ (g)2c 2 r + ((^(c 2 ?* + cs ) (modulo p). Thus Y p modp = 0, and the desired 
relation has been proved. 

Now we can prove that the sequence ( X n ) of integers defined in the “hint” satisfies 
the relation 

X n+pf =X n -\~ tp f (modulo p /+1 ), n > 0, 

for some t with tmodp =4 0, and for all / > 1 . This suffices to prove that the sequence 
(X n modp e ) has period length p e , for the length of the period is a divisor of p e but not 
a divisor of p e ~ x . The above relation has already been established for / — 1 , and for 
/ > 1 it can be proved by induction in the following manner: Let 

X„ +P f = Xn + tp 1 + Z„p /+I (modulo p f+2 ); 

then the quadratic law for generating the sequence, with d = pr, a — 1 + PS, yields 
Z n+ i = 2 rtnc -f st Z n (modulo p). It follows that Z n + P = Z n (modulo p); hence 

XnJnkpf = X n + KtP } + Z n p f+1 ) (modulo p* +2 ) 

for k = 1, 2, 3,...; setting k — p completes the proof. 

Notes: If f(x) is a polynomial of degree higher than 2 and X n +i = f(Xn), 
the analysis is more complicated, although we can use the fact that f(m 4- P k ) = 

f{m) + p k f'(m) -f- p 2k }"{m)/2\ -j-to prove that many polynomial recurrences give 

the maximum period. For example, Coveyou has proved that the period is m = 2 e if 
/(0) is odd, /'O') = 1 , /"(/) = 0 , and f(j -f 1 ) = /(/) -f 1 (modulo 4) for j = 0 ,1, 2 , 3 . 
[Studies in Applied Math. 3 (1967), 70-111.] 

9. Let X n — 4Y n -|-2; then the sequence Y n satisfies the quadratic recurrence Y n -fi = 
(4T £ + SY n + 1 ) mod 2 e—2 . 

10. Case 1: Xo = 0 , Xi = 1 ; hence X n = F n . We seek the smallest n for which F n = 
0 and F n+ i ~ 1 (modulo 2 e ). Since F 2n = F n (F n -i + F n+ i), F 2n +i = F 2 n + F 2 n+1 , 
we find by induction on e that, for e > 1 , F 3 . 2 e-i = 0 and F 3 . 2 e-i + 1 = 2 e -f* 1 
(modulo 2 e+1 ), This implies that the period is a divisor of 3 • 2 e—1 but not a divisor 
of 3 • 2 e ~ 2 , so it is either 3 • 2 e ~ 1 or 2 e ~ 1 . But F 2 e—i is always odd (since only F$ n is 
even). 

Case 2: Xo = a, Xi = b. Then X n = aF n —i + t»F n ; we need to find the smallest 
positive n with a(F n +i — F n ) + 6 F n = a and aF n -f bF n + i = b. This implies that 
(b 2 — ab — a 2 )F n = 0, (b 2 — ab — a 2 )(F n +i — 1) = 0; and b 2 — ab — a 2 is odd (i.e., 
prime to m) so the condition is equivalent to F n = 0, F n +i = 1. 

Methods to determine the period of F n for any modulus appear in an article by 
D. D. Wall, AMM 67 (1960), 525-532. Further facts about the Fibonacci sequence 
mod 2 e have been derived by B. Jansson (Random Number Generators (Stockholm: 
Almqvist & Wiksell, 1966), Section 3C1]. 

11. (a) We have z y = 1 -f- f(z)u(z ) -\- p e v(z) for some u(z), v(z), where v(z ) ^ 0 
(modulo f(z ) and p). By the binomial theorem 

2 X ? 1 4 - p e+l V {z) 4 - p 2e + 1 v{ Z ) 2 (p — l )/2 

plus further terms congruent to zero (modulo f(z ) and p e+2 ). Since p e > 2, we have 
z Xp ~ l-\~p e+1 v(z) (modulo f(z) and p e+2 ). If p e+l v(z) ~ 0 (modulo f(z ) and p e+2 ), 
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there must exist polynomials a(z ) and b(z ) such that p e+1 {y(z) -j-pa(z)) = f(z)b(z). 
Since /(0) = 1 , this implies that b(z) is a multiple of p e+1 (by Gauss’s Lemma 4.6. 1 G); 
hence v(z) = 0 (modulo f(z) and p), a contradiction. 

(b) If 2 r x — 1 = f(z)u(z) p e v(z), then 

G{z) = u(z)/(z x — 1 ) + p e v(z)/f{z){z x — 1 ); 

hence A n -t-x = A n (modulo p e ) for large n. Conversely, if ( A n ) has the latter property 
then G(z) — u(z) -j- v(«)/(l — 2 X ) + p e H{z), for some polynomials u(z) and v(z), and 
some power series H(z), all with integer coefficients. This implies the identity 1 — z x = 
u(z)f(z)( 1 — 2 X )-j- v(z)f(z) -\-p e H(z)f(z)(l — 2 X ); and H(z)f(z)( 1 — z x ) is a polynomial 
since the other terms of the equation are polynomials. 

(c) It suffices to prove that X(p e ) 7 ^ X(p e+1 ) implies that X(p e+1 ) = p\{p e ) 7 ^ 
X(p e + 2 ). Applying (a) and (b), we know that X(p e+2 ) 7 ^ p\{p e ), and that X(p c+1 ) 
is a divisor of p\(p e ) but not of X(p € ). Hence if X(p e ) = p f q, where qmodp 7 ^ 0, 
then X(p e+1 ) must be p f+1 d, where d is a divisor of q. But now X n+p /+i = X n 
(modulo p e ); hence p fJrl d is a multiple of p f q, hence d = q. [Note: The hypothesis 
p e > 2 is necessary; for example, let ai = 4, a 2 = — 1 , k = 2 ; then (A n ) = 1 , 4, 15, 
56, 209, 780, ...; X(2) = 2 , X(4) = 4, X( 8 ) = 4.] 

(d) g(z) = X 0 + (Xi - aiXo)z + • • • 

+(Xfc_i — ajX/c— 2 — a,2Xk— 3 — • • • — ajt—iXo)^ fc— 1 . 

(e) The derivation in (b) can be generalized to the case G(z) = g{z)/f{z)\ then 
the assumption of period length X implies that g{z){ 1 — z x ) = 0 (modulo f(z) and p e ); 
we treated only the special case g(z) = 1 above. But both sides of this congruence can 
be multiplied by Hensel’s b(z), and we obtain 1 — z x ~ 0 (modulo f(z) and p e ). 

Note: A more “elementary” proof of the result in (c) can be given without using 
generating functions, using methods analogous to those in the answer to exercise 8 : If 
A\T. n = A n + p e B n , for n = r, r -f-1,..., r + k — 1 and some integers B n , then this 
same relation holds for a Un > r if we define B r +k, B r +k+ 1 , • • • by the given recurrence 
relation. Since the resulting sequence of B 's is some linear combination of shifts of the 
sequence of A’s, we will have Bx+n = B n (modulo p e ) for all large enough values 
of n. Now X(p e+1 ) must be some multiple of X — X(p e ); for all large enough n we have 

A n+ j\ = A n -\-p e (Bn+Bn+\-\-B n + 2 \-\ - \-B n +(j- i)x) = A n +jp € B n (modulo p 2e ) 

for j = 1,2,3,... .No k consecutive B’s are multiples of p; hence X(p e+1 ) = pX(p c ) 7 ^ 
X(p e + 2 ) follows immediately when e > 2. We still must prove that X(p e+2 ) 7 ^ pX(p e ) 
when p is odd and e = 1 ; here we let Bx+n — Bn-\-pC n , and observe that C n +x = C n 
(modulo p) when n is large enough. Then A n _|_p = An +p 2 (Bn + ( 2 )^) (modulo p 3 ), 
and the proof is readily completed. 

For the history of this problem, see Morgan Ward, Trans. Amer. Math. Soc. 35 
(1933), 600-628; see also D. W. Robinson, AMM 73 (1966), 619-621. 

12. The period length mod 2 can be at most 4; and the period length mod 2 e+1 is at 
most twice the maximum length mod 2 e , by the considerations of the previous exercise. 
So the maximum conceivable period length is 2 e+1 ; this is achievable, for example, in 
the trivial case a = 0 , b = c = 1 . 

13, 14. Clearly Z n +\ = Z n , so X' is certainly a divisor of X. Let the least common 

multiple of X' and Xi be X^, and define X ' 2 similarly. X n -f Y n = Z n = = 

X n 4 - ^n+x^) 80 X'x is a multiple of X 2 . Similarly, X 2 is a multiple of Xi. This yields 
the desired result. (The result is “best possible” in the sense that sequences for which 
X' = X Q can be constructed, as well as sequences for which X' = X.) 
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15. Algorithm M generates (X n +k,Y n ) in step Ml and outputs Z n = X n +/c— q n in 
step M3, for all sufficiently large n. Thus { Z n } has a period of length X', where X' is 
the least positive integer such that X n +k—q n = -^n+x'+fc—q n+x , for all large n. Since 
X is a multiple of Xi and X 2 , it follows that X' is a divisor of X. (These observations 
are due to Alan G. Waterman.) 

We also have n -{- k — q n = n + X' + A: — qn+\' (modulo Xi) for all large n, 
by the distinctness of the X’s. The bound on (q n ) implies that q n +x' = qn c f° r 
all large n, where c = \' (modulo Xi) and \c\ < 5 X 1 . But c must be 0 since (q n ) is 
bounded. Hence X' = 0 (modulo Xi), and q n +x' = q n for all large n; it follows that X' 
is a multiple of X 2 and Xi, so — X. 

Note: The answer to exercise 3.2.1.2-4 implies that when {Y n ) is a linear congruen- 
tial sequence of maximum period modulo m = 2 C , the period length X 2 will be at most 
T~ 2 when A; is a power of 2 . 

16. There are several methods of proof. 

(1) Using the theory of finite fields. In the field with 2 k elements let £ satisfy 

£ k = 1 -j-(- a k . Let /(bi £ fc—1 -\- -) -b k ) = bk, where each bj is either zero 

or one; this is a linear function. If word X in the generation algorithm is (6162 • • • bk )2 
before (10) is executed, and if 1 -(-••• + — £ n , then word X represents 

£ n+1 after (10) is executed. Hence the sequence is /(£ n ), /(£ n+1 ), /(£ n+2 ), •••) and 

M n+k ) = f(Ci k ) = /(0 ir +t ” 1 + • • • + a„C) = a./(r +fc ~ 1 ) + ■ ■ ■ + akttO- 

(2) Using brute force, or elementary ingenuity. We are given a sequence X n j, 
n > 0 , 1 < j < /c, satisfying 


— X n (j -—(— OijXy ii, 1^7 ^ fc) X(n-)-i)fc — CLkXn 1 (modulo 2 ). 

We must show that this implies X n k = aiX( n —i)k + ••• + akX^ n —k)k, for n > k. 
Indeed, it implies X n j = aiX( n —i)j + • ■ • + a k X( n —k)j when 1 < j < A; < n. This 
is clear for j = 1, since X n \ ~ aiX( n _i)i -f-X( n _ 1)2 aiX( n _i)i -{- a 2 X( n _ 2)2 -h 
X (n — 2 ) 3 , etc. For j > 1, we have by induction 


X n j -— 1) dj—iXnl 

= ^ ^ — 1 ) (Lj — 1 ^ ^ i)l 

l<i<k l<i<fc 

= ^ ^ fli(X(n-f-1— i)(j — 1) Oij —iX (n —i)i) 


1 <i<k 

= diX( n — l)j H-h &kX( n —k)j- 


This proof does not depend on the fact that operations were done modulo 2, or modulo 
any prime number. 

17. (a) When the sequence terminates, the (k — l)-tuple (X n +i, ... ,X n +k— 1) occurs 
for the (m -fi l)st time. A given (A; — l)-tuple (X r -f 1, ... ,X r +k— 1) can have only m 
distinct predecessors X r , so one of these occurrences must be for r = 0. (b) Since 

the (A: — l)-tuple ( 0 ,..., 0) occurs (m -)- 1 ) times, each possible predecessor appears, 
so the A>tuple (ai,0, ...,0) appears for all ai,0 < ai < m. Let 1 < s < k and 
suppose we have proved that all A;-tuples (ai ,..., a s , 0,..., 0) appear in the sequence 
when d s 7 ^ 0. By the construction, this A;-tuple would not be in the sequence unless 
(ai,..., a s , 0,..., 0, y) had appeared earlier for 1 < y < m. Hence the (k — 1)-tuple 
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(ai,..., a s , 0,..., 0) has appeared m times, and all m possible predecessors appear; this 
means (a, ai ,..., a s , 0,..., 0) appears for 0 < a < m. The proof is now complete by 
induction. 

The result also follows from Theorem 2.3.4.2D, using the directed graph of exercise 
2.3.4.2-23; the set of arcs from (xi, ..., Xj, 0, ..., 0) to (X 2 , ..., Xj, 0, ..., 0,0), where 
x 3 0 and 1 < < fc, forms an oriented subtree related neatly to Dewey decimal 

notation. 

18. The third-most-significant bit of U n + i is completely determined by the first and 
third bits of U n , so only 32 of the 64 possible pairs (|_8C/ n J, [8£/n+iJ) occur. (Notes: If we 
had used, say, 11-bit numbers U n = (.XunXun+i .. -^Gin+io) 2 , the sequence would 
be satisfactory for many applications. If another constant appears in A having more 
“one” bits, the generalized spectral test might give some indication of its suitability. 
See exercise 3.3.4-24; we could examine v t in dimensions t = 36, 37, 38,... .) 

21. [J. London Math. Soc. 21 (1946), 169-172.] Any sequence of period length m k — 1 
with no k consecutive zeros leads to a sequence of period length m k by inserting a zero 
in the appropriate place, as in exercise 7; conversely, we can start with a sequence of 
period length m k and delete an appropriate zero from the period, to form a sequence of 
the other type. Let us call these “(rn, k) sequences” of types A and B. The hypothesis 
assures us of the existence of (p, k ) sequences of type A, for all primes p and all k > 1; 
hence we have (p, k ) sequences of type B for all such p and k. 

To get a (p e , k) sequence of type B, let e = qr, where q is a power of p and r is not 
a multiple of p. Start with a (p, qrk ) sequence of type A, namely Xo,Xi,X 2 ,...; then 
(using the p-ary number system) the grouped digits (Xo ...X <7 _i) p , (X<, ... X 2 q —i) p ,... 
form a ( p q ,rk ) sequence of type A, since q is relatively prime to p qrk — 1 and the 
sequence therefore has a period length of p qrk — 1. This leads to a ( p q ,rk ) sequence 
(" Y n ) of type B; and (YoYi ... Y r _i) p <?, (Y r Y r +i ... Y^r— i) P <?, ... is a (p qr , k) sequence of 
type B by a similar argument, since r is relatively prime to p qk . 

To get an ( m,k ) sequence of type B for arbitrary m, we can combine ( p e ,k ) 
sequences for each of the prime power factors of m using the Chinese remainder 
theorem; but a simpler method is available. Let (X n ) be an (r, k) sequence of type B, 
and let (Y n ) be an (s, k) sequence of type B, where r and s are relatively prime; then 
(sX n -(- Y n ) is an (rs, k) sequence of type B. 

A simple, uniform construction that yields (2, k) sequences for arbitrary k has been 
discovered by A. Lempel [IEEE Trans. C-19 (1970), 1204-1209]. 

22. By the Chinese remainder theorem, we can find constants ai,..., having desired 
residues mod each prime divisor of m. If m = pip 2 ... pt, the period length will be 
lcm(pi — 1 ,... ,p k — 1). In fact, we can achieve reasonably long periods for arbitrary m 
(not necessarily squarefree), as shown in exercise 11. 

23. Period length at least 2 55 — 1; possibly faster than (7), see exercise 3.2.1.1-5. 
Furthermore, R. Brent has pointed out that the calculations can be done exactly on 
floating point numbers in [0,1). 

24. Run the sequence backwards. In other words, if Z n = Y— n we have Z n = 

(Z n — m-\-k H - Zn —rn) mod 2. 

25. This actually would be slower and more complicated, unless it can be used to save 
subroutine-calling overhead in high-level languages. (See the FORTRAN program in 
Section 3.6.) 
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27. Let J n = \kX n /m\. Lemma: After the (k 2 -\- 7 k — 2)/2 consecutive values 

Q fc+2 1Q fc+i 2 0 k ... (k — 1) 0 3 

occur in the { J n ) sequence, Algorithm B will have V[i\ < m/k for 0 < j < k, and 
also Y < m/k. Proof. Let S n be the set of positions i such that V[i\ < m/k just 
before X n is generated, and let j n be the index such that V\j n \ «— X n . If j n S n 
and J n = 0, then S n+1 = S n U {jn}; if jn € S n and J n = 0, then S n +i = S n and 
jn+i = 0. After k - j- 2 successive 0’s, we must therefore have 0GS n and j n + i = 0. 
Then after “1 0 fc+1 ” we must have {0,1} C S n and j n + i = 0; after “2 0 fc ” we must 
have {0,1, 2} C S n and j n + 1 = 0; and so on. 

Corollary: For X > 2(k 2 J~7k — 2)/c (fc2+7fc—2)/2 , either Algorithm B yields a period 
of length X or the sequence { X n ) is poorly distributed. Proof. The probability that 
any given length-7 pattern of J’s does not occur in a random sequence of length X is 
less than (1 — k~ l ) x/l < exp(— k~ l \/l). For l = (k 2 -\- 7k — 2)/2 this is at most 
e~ 4 ; hence the stated pattern should appear. After it does, the subsequent behavior of 
Algorithm B will be the same each time it reaches this part of the period. 


SECTION 3.3.1 

1. There are k = 11 categories, so the line u — 10 should be used. 

0_2__2_4_5._g__3__6__5_jLA.-2. 

49’ 49’ 4<J’ 49’ 49’ 49’ 49’ 49’ 49’ 49’ 49’ 

3 . V = 7 only very slightly higher than that obtained from the good dice! There 
are two reasons why we do not detect the weighting: (a) The new probabilities (cf. 
exercise 2) are not really very far from the old ones in Eq. ( 1 ). The sum of the two dice 
tends to smooth out the probabilities; if we considered instead each of the 36 possible 
pairs of values, and counted these, we would probably detect the difference quite rapidly 
(assuming that the two dice are distinguishable), (b) A far more important reason is 
that n is too small for a significant difference to be detected. If the same experiment 
is done for large enough n, the faulty dice will be discovered (see exercise 12 ). 

4 . p s = for 2 < s < 12 and s 7 ^ 7; pj = g. The value of V is 16^, which falls 
between the 75% and 95% entries in Table 1; so it is reasonable, in spite of the fact 
that not too many sevens actually turned up. 

5 . K ~ 2 0 = 1.15; ^20 — 0.215; these do not differ significantly from random behavior 
(being at about the 94% and 86 % levels), but they are mighty close. (The data values 
in this exercise come from Appendix A, Table 1.) 

6. The probability that Xj < x is F[x), so we have the binomial distribution 
discussed in Section 1.2.10. F n (x) = s/n with probability F(z) s (l — F(x)) n ~ s ; 

the mean is F(x); the standard deviation is \J F(x)(l — F{xj)/n. [Cf. Eq. 1.2.10-19. 
This suggests that a slightly better statistic would be to define 

Kjf = y/n max (F n (x) — F(x))/y/F(x)(l — F(x)); 

—00 < 1 < 00 


see exercise 22. We can calculate the mean and standard deviation of F n (x) — F n {y), 
for x < y, and obtain the covariance of F n (x), F n (y). Using these facts, it can be 
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shown that for large values of n the function F n (x) behaves as a “Brownian motion,” 
and techniques from this branch of probability theory may be used to study it. The 
situation is exploited in articles by J. L. Doob and M. D. Donsker, Annals Math. St at. 
20 (1949), 393--403 and 23 (1952), 277-281; this is generally regarded as the most 
enlightening way to study the KS tests.] 

7. ((Cf. Eq. (13).) Take j — n to see that is never negative ana it can get as 
high as y/n. Similarly, take j = 1 to make the same observations about K~. 

8. The new KS statistic was computed for 20 observations. The distribution of K/q 
was used as F(x) when the KS statistic was computed. 

9. The idea is erroneous, because all of the observations must be independent. There 
is a relation between the statistics K„ and K~ on the same data, so each test should be 
judged separately. (A high value of one tends to give a low value of the other.) Similarly, 
the entries in Figs. 2 and 5 (which show 15 tests for each generator) do not show 15 
independent observations, because the maximum-of-5 test is not independent of the 
maximum-of-4 test. The three tests of each horizontal row are independent (because 
they were done on different parts of the sequence), but the five tests in a column are 
somewhat correlated. The net effect of this is that the 95-percent probability levels, 
etc., which apply to one test, cannot legitimately be applied to a whole group of tests 
on the same data. Moral: When testing a random number generator, we may expect 
it to “pass” each of several tests, e.g., the frequency test, maximum test, run test; but 
an array of data from several different tests should not be considered as a unit since 
the tests themselves may not be independent. The K^ and K~ statistics should be 
considered as two separate tests; a good source of random numbers will pass both of 
the tests. 

10. Each Y s is doubled, and np s is doubled, so the numerators of (6) are quadrupled 
while the denominators only double. Hence the new value of V is twice as high as the 
old one. 

11. The empirical distribution function stays the same; the values of and K~ are 
multiplied by \/2. 

12. Let Z s = (Y s — nq s )/y/nq s . The value of V is n times 

^2 “ P s + V<ls/nZs) 2 /p s , 

1 < s < fc 

and the latter quantity stays bounded away from zero as n increases (since Z s n~ 1 ' 4 
is bounded with probability 1). Hence the value of V will increase to a value that is 
extremely improbable under the p s assumption. 

For the KS test, let F(x) be the assumed distribution, G(x ) the actual distribution, 
and let h = max|G(x) — F(x)\. Take n large enough so that |F n (:r)— G(x )| > hf2 
occurs with very small probability; then |F n (x) — F(x)| will be improbably high under 
the assumed distribution F(x). 

13. (The “max” notation should really be replaced by “sup” since a least upper bound 
is meant; however, “max” was used in the text to avoid confusing too many readers 
by the less familiar “sup” notation.) For convenience, let Xo = —oo, X n +i = +oo. 
When Xj < x < Xj+i, we have F n (x) = j/n; therefore max(F n (x)— F(x)) = 
j/n — F(Xj) and max(F(x) — F n (x)) = F(Xj+ 1 ) — j/n in this interval. As j varies 
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from 0 to n, all real values of x are considered; this proves that 


if + = max (- F{Xj) ); 

0 < _?< n \n J 

K~ = Vn max ( F(X 3 ) — ). 

l<i<n+l\ n J 

These are equivalent to (13), since the extra term under the maximum signs is non¬ 
positive and it must be redundant by exercise 7. 

14. The logarithm of the left-hand side simplifies to 


- e nin ( i+ jy + V 1 ^ 


l<s<fc 


-i £ e K ,+ ^) +o 0' 

l<s<k l<s<k v V t's/ 

and this quantity simplifies further (upon expanding ln(l ~f- Z s /yjnp s ) and realizing 
that J2 1<s<k Zsy/npl — 0) to 


5 E *’ + 


l<s<fc 


ln(27ra) — i ln(pi ■ ■ -Pt) + 0 \ — 

2 2 \Vn. 


15. The corresponding Jacobian determinant is easily evaluated by (a) removing the 
factor r n—1 from the determinant, (b) expanding the resulting determinant by the 
cofactors of the row containing “cos 0\ — sin 6i 0 ... 0” (each of the cofactor deter¬ 
minants may be evaluated by induction), and (c) recalling that sin 2 0i -j- cos 2 0i = 1. 


16. 



zV2x+y 

exp 





The latter integral is 


L 


■,\/2x 


e~ u2/2x du + 


J-f 

3x 2 J o 


;%/2i 


>r u / 2x u 3 


When all is put together, the final result is 


du-\~0 



l{x + 1, s + z\[2x -f- y) 

T(xH-l) 



If we set z\/2 — x p and write 



where tj 2 = x -f- zV2x -j- y, we can solve for y to obtain y = §(1 z 2 ) + 0{\/y/x), 

which is consistent with the above analysis. The solution is therefore t — v -f 2 y/vz + 

%z 2 - l+CWv^)- 





The numerator in (24) is P n y t \{n). 

18. We may assume that F(x) = x for 0 < x < 1, as remarked in the text’s derivation 
of (24). If 0 < Xi < • • • < X n < 1, let Zj = \-X n+1 - 3 . We have 0 < Z x < ■ ■ • < 
Z n < 1; and K+ evaluated for X\,.. .,X n equals K~ evaluated for Z\, .. Z n . This 
symmetrical relation gives a one-to-one correspondence between sets of equal volume 
for which K+ and K~ fall in a given range. 

23. Let m be any number > n. (a) If [mF(X t )\ = [mF(Xj )J and i > j, then 
ifn — F(Xi) > j/n — F(Xj). (b) Start with a k — 1.0, b k = 0.0, and c k = 0 for 
0 < k < m. Then do the following for each observation X 3 : Set Y ■<— F(Xj), k <— 
[mY J, a k <— min(a fc ,y), b k max(6fc,y), c k c k + 1. (Assume that F(X 3 ) < 1 so 
that k < m.) Then set j *— 0, r + <— r~ <— 0, and for k = 0, 1, ..., m — 1 (in this 
order) do the following whenever c k > 0: Set r~ <— max(r~, a k — j/n), j <— j + c k , 
r + max(r + , j/n — b k ). Finally set <— \/nr + , K~ <— yfnr~. The time required 
is 0(m -j- n), and the precise value of n need not be known in advance. (If the estimate 
(k + \)/m is used for a k and b k , so that only the values c k are actually computed 
for each k, we obtain estimates of K+ and K~ good to within \sjnjm , even when 
m < n.) [ACM Trans. Math. Software 3 (1977), 60-64.] 


SECTION 3.3.2 

1. The observations for a chi-square test must be independent, and in the second 
sequence successive observations are manifestly dependent, since the second component 
of one equals the first component of the next. 

2. Form i-tuples (Y Jt ,..., Y Jt +t— i), for 0 < j < n, and count how many of these 
equal each possible value. Apply the chi-square test with k = df and with probability 
l/d 1 in each category. The number of observations, n, should be at least 5^. 
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3. The probability that j values are examined,, i.e., the probability that Uj— i is the 
nth element of the sequence lying in the range a < Uj—i < (3, is easily seen to be 

gliyu-rr", 


by enumeration of the possible places in which the other n — 1 occurrences can appear 
and by evaluating the probability of such a pattern. The generating function is G(z ) = 
(pz/( 1 — (1 — p)z)) n , which makes sense since the given distribution is the n-fold 
convolution of the same thing for n — 1. Hence the mean and variance are proportional 
to n; the number of U 's to be examined is now easily found to have the characteristics 
(min n, ave n/p , max oo, dev y/n{\ — p)/p). A more detailed discussion of this 
probability distribution when n = 1 may be found in the answer to exercise 3.4.1—IT; 
see also the considerably more general results of exercise 2.3.4.2-26. 

4. The probability of a gap of length > r is the probability that r consecutive t/’s 
lie outside the given range, i.e., (1 — p) r . The probability of a gap of length exactly r 
is the above value for length > r minus the value for length > (r -J- 1). 

5. As N goes to infinity, so does n (with probability 1), hence this test is just the 
same as the gap test described in the text except for the length of the very last gap. 
And the text’s gap test certainly is asymptotic to the chi-square distribution stated, 
since the length of each gap is independent of the length of the others. [Notes; A quite 
complicated proof of this result by E. Bofinger and V. J. Bofinger appears in Annals 
Math. Stat. 32 (1961), 524-534. Their paper is noteworthy because it discusses several 
interesting variations of the gap test; they show, for example, that the quantity 

y- (Y, - (iVp)Pr) 2 
( N P)P* 


does not approach a chi-square distribution, although others had suggested this statistic 
as a “stronger” test because Np is the expected value of n.\ 

7. 5, 3, 5, 6, 5, 5, 4. 

8. See exercise 10, with w — d. 

9. (Change d to w in steps Cl and C4.) We have 


Pr 


d(d — 1)... (d — u>4-l)[r — l) 
d r {u; - 1J ; 


for vo < r < t; 



+ ••• + 



10. As in exercise 3, we really need consider only the case n — 1. The generating 
function for the probability that a coupon set has length r is 

d ! y^f r — l \( z Y _ wf d—l \ f d — w + 1 A 

(d — ' w )! \ w — l/vdy yd — zj \d — (w — i)z J 


G{z) = 



536 ANSWERS TO EXERCISES 


3.3.2 


by the previous exercise and Eq. 1.2.9-28. The mean and variance are readily computed 
using Theorem 1.2.10A and exercise 3.4.1-17. We find that 

mean(G) = w + ( jA- — 1^ H-b L_^ _|_ , — l) = d{H d — H d - W ) = \i\ 

var(G) = i\Hf - H^lj - d(H d - H d - W ) = a 2 . 

The number of IT s examined, as the search for a coupon set is repeated n times, 
therefore has the characteristics (min von, ave pn, max oo, dev Oy/n). 

11. | 1 | 2 | 9 8 5 3 | 6 | 7 0 | 4 | . 

12. Algorithm R (Data for run test). 

Rl. [Initialize.] Set j <-1, and set C0UNT[1] <- C0UNT[2] *- 4 - C0UNT[6] «- 0. Also 

set U n «— Un — 1 j for convenience in terminating the algorithm. 

R2. [Set r zero.] Set r <— 0. 

R3. [Is Uj < Uj+ 1 ?] Increase r and j by 1. If Uj < Uj+ 1 , repeat this step. 

R4. [Record the length.] If r > 6 , increase C0UNT[6] by one, otherwise increase C0UNT[r] 
by one. 

R5. [Done?] If j < n — 1, return to step R2. | 

13. There are (p + < 7 -f 1)( P ^ q ) ways to have Ui— 1 ^ Ui < • • • < Ui+ P — 1 ^ < 

••• < Di-f-p+q—i; subtract of these in which Ui —1 < Ui, and subtract 

( p+ i +1 ) for those in which Ui- f P —1 < C/t+ P ; then add in 1 for the case that both 
Ui —1 < Ui and Dt+ P _ 1 < Di+ P , since this case has been subtracted out twice. (This 
is a special case of the “inclusion-exclusion” principle, which is explained further in 
Section 1.3.3.) 

14. A run of length r occurs with probability 1/r! — l/(r + 1)!, assuming distinct C/’s. 

15. This is always true of F(X) when F is continuous and S has distribution F\ see 
Section 3.3.1C. 

16. (a) Zj t = max(Zj(t_i),%+i)(t_i)). If the Zj(t— 1 ) are stored in memory, it is 
therefore a simple matter to transform this array into the set of Zj t with no auxiliary 
storage required, (b) With his “improvement,” each of the V’s should indeed have the 
stated distribution, but the observations are no longer independent. In fact, when Uj 
is a relatively large value, all of Zj t , Z^—i) t , ..., Z^— t +i)t will be equal to U 3 ; so we 
almost have the effect of repeating the same data t times (and that would multiply V 
by t, cf. exercise 3.3.1-10). 

17. (b) By Lagrange’s identity, the difference is J2 Q<k < J<n {UkVj — U'V' k ) 2 , and this 
is certainly positive, (c) Therefore if D 2 = N 2 , we must have U k V'j — UjV' k = 0, for 
all pairs j,k. This means that the matrix 

(Ui Us--- Un-A 

has rank < 2, so its rows are linearly dependent. (A more elementary proof can be 
given, using the fact that U^Vj — U'jVa = 0 for 1 < j < n implies the existence of 
constants a, (3 such that odJ' 3 pV 3 = 0 for all j, provided that Uq and V' Q are not 
both zero; the latter case can be avoided by a suitable renumbering.) 
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18. (a) The numerator is — (Uo — U{f, the denominator is (Uo — Ui) 2 . (b) The nu¬ 
merator in this case is — (U$ + U\ + U\ — UoUi — U\U 2 — U 2 Uo ); the denominator 

is 2(Uq -\ -U 2 U 0 ). (c) The denominator always equals Ylo<j < k<n^ ~ Uc) 2 , by 

exercise 1.2.3-30 or 1.2.3-31. 

21. The successive values of c r —i = s — 1 in step P2 are 2, 3, 7, 6, 4, 2, 2, 1, 0; hence 
/ = 886862. 

22. 1024 = 6! ~f- 2 ■ 5! -j- 2 • 4! -(- 2 ■ 3! -f- 2 • 2! -j- 0 • 1!, so we want the successive values 
of s — 1 in step P2 to be 0, 0, 0, 1, 2, 2, 2, 2, 0; working backwards, the permutation 
is (9,6,5, 2, 3, 4,0,1,7,8). 


SECTION 3.3.3 


1 - y((x/y)) + hy — ?yt( x /y)- 

2. See exercises 1.2.4-38 and 1.2.4-39(a), (b), (g). 

3. f(x) = ^ n>1 (—sin27rnx)/n, which converges for all x. (The representation in 
Eq. (24) may be considered a “finite” Fourier series, for the case when x is rational.) 

4. d — 2 10 • 5. Note that we have X n +i < X n with probability \ -{- e, where 

|e| < df(2 • 10 10 ) = 1/(2 • 5 9 ); 

hence every potency-10 generator is respectable from the standpoint of Theorem P. 

5. An intermediate result: 


E 

0 < x < m 


x s(x) 
m m 


1 , , , m c x' 

— ala, m,c)-\ -. 

12 v '4 2m 2m 


6. (a) Use induction and the formula 


hj + c 


hj + c — 1 
k 



A _ !/ 
k 2 \ k J 



hj + c — A 
_ k ) 


(b) Use the fact that 




7. Take m — h, n = k, k — 2 in the second formula of exercise 1.2.4-45: 



— kh(h — 1). 


The sums on the left simplify, and by standard manipulations we get 


l2 , , , h , hr , k 1 

m k — hk ---— 

2 6/c 12 4 


-a(h, k, 0) — — cr(fc, h, 0) -f- — a(l, k, 0) = h 2 k — hk. 
6 6 12 


Since <j( 1, fc, 0) = {k — l)(fc — 2 )/k, this reduces to the reciprocity law. 
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8. See Duke Math. J. 21 (1954), 391-397. 

9. Begin with the handy identity J2 0 ^ k<r [kp/r\[kq/r\ + Eo<fc<pL fc( ?/PJL /cr /PJ + 
^2 0<k<q [k r /q\[kp/q\ = (p — l)(q — l)(r — 1) for which a simple geometric proof is 
possible. [U. Dieter, Ahh. Math. Sem. Univ. Hamburg 21 (1957), 109-125.] 

10. Obviously cr(k — h, k, c ) = — c{h, k, —c), cf. (8). Replace j by k — j in definition 
(16), to deduce that <j{h, k, c ) = a(h, k, —c). 


1L(a) £ (Is fc 


Q<j<dk 



hj + c 


sum on i. (b) 


hj-\-c + 9 
k 


= £ 


0 < i < d 
o<i<fc 


i k + j 
dk 



hj + c 
k 


use (10) to 


hj + c 



now sum. 


12. Since (( hj ^~ c )) runs through the same values as ((£)) in some order, Cauchy’s 
inequality implies that a(h, k, c ) 2 < a(h, k, 0) 2 ; and <j( 1, k, 0) may be summed directly, 
cf. exercise 7. 


13. cr(h, k, c ) -(- 


3(/c — 1) 
k 


12 

~ ~k 2s 


u 


—cj 


k (u>— h 3 — l)(ub — 1) 

0 <j<k 


+ j(c mod k) — 6( ( ^ 


k 


k 


if hh! = 1 (modulo k). 

14. (2 38 — 3 • 2 20 5)/(2 70 — 1) « 2 ~ 32 . An extremely satisfactory global value, in 

spite of the local nonrandomness! 


15. Replace c 2 where it appears in (19) by [cj[c"|. 


16. The hinted identity is equivalent to mi = p r m r +1 -j- p r —im r +2 for 1 < r < t; 
this follows by induction, cf. also exercise 4.5.3-32. Now replace Cj by b r m r+ i 

and compare coefficients of b z b 3 on both sides of the identity to be proved. 

Note: For all exponents e > 1 we have 


£ (-F +1 

i<j<t 


rrijmj+i 


1 

mi 


1 <J<t 


(<S-Cj + 1 ) 

- Pi -1 

c 3 c i+1 


by a similar argument. 

17. During this algorithm we will have k = m 3 , h = m^+i, c — Cj, p = p 3 — 1, 

p' = Pj _ 2; s = (—1) 7+1 for j = 1, 2, ..., t + 1. 

Dl. [Initialize.] Set A*- 0, B *— h, p <— 1, p' <— 0, 5 •*— 1. 

D2. [Divide.] Set a <— [k/h\, b «— [c/h\, r <— cmodh. (Now a — a 3 , b — bj, and 

r = Cj+ 1 .) 

D3. [Accumulate.] Set A <— A — (a — 66)s, B <— B -\- Sbp(c -j- r)s. If r 7^ 0 or c = 0, 
set A +- A — 35. If h = 1, set B <— B -)- ps. (This subtracts 3e(mj+i,Cj) and 
also takes care of the ]C(—l) :?+1 / m J rn j+i terms.) 

D4. [Prepare for next iteration.] Set c <— r, s < -s; set r <— k — ah, k <— h, h <— r; 

set r <— ap + p’, p' <— p, p <— r. If h > 0, return to D2. | 
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At the conclusion of this algorithm, p will be equal to the original value fco of k, so 
the desired answer will be A-\-B/p. The final value of p' will be h! if s < 0, otherwise p' 
will be ko — h!. It would be possible to maintain B in the range 0 < B < fco, by making 
appropriate adjustments to A, thereby requiring only single-precision operations (with 
double-precision products and dividends) if fco is a single-precision number. 

18. A moment’s thought shows that the formula 

S{h, k, c, z ) = £o< i<fc (b'/*J ” LC? - z)/k\) ({{hj + c)/fc)) 

is in fact valid for all z, not only when fc > z. Writing [j/k\ — [(j — z)jk\ = § + 
((nr 5 )) — ((£)) + — h^i^nr) an< ^ carrying out the sums yields S(h,k,c,z ) = 

zd((c/d))/k + j2<7(h, fc, hz + c) — + cr(h, fc, c ) + ^((c/fc)) — ^{{(hz + c)/fc)), where d — 
gcd(A fc). [This formula allows us to express the probability that X n +i < X n < a in 
terms of generalized Dedekind sums, given a.] 

19. The desired probability is 


Eo<x<m(U ;r “ Q )/ m J — L( x — /^MJXU 5 ^) — ot')/m\ — L(s(x) — P')/m\)/m 

Eo< x < m + ((^)) - ((A 2 )) + - is&j) 

'ax+c—0 1 \\ ffax-\-c—a.'\\ , x f.( ax + c— c+ x ^ox+c- 0' 


'0 <x<m 

x+A + ((- 


m 

(3—ot (3' — ex' 
m m 




-)-K 


))/m 


-f- m,c -\- aa — a') — a(a, m, c -j- aa — (3') 

-(- a(a, m,c-\- a(3 — j3') — o(a, m, c + a /3 — a')) -)- e, 


where |e| < 2.5/m. 

[This approach is due to U. Dieter. The discrepancy between the true probability 
and the ideal value 13 m ° t is bounded by ^2 1<J<t Q>j /4m, according to Theorem K; 
conversely, by choosing a, ft, a', (3' appropriately we will obtain a discrepancy of at 
least half this bound when there are large partial quotients, by the fact that Theorem K 
is “best possible.” Note that when a ^ \Jm the discrepancy cannot exceed 0(1/ y/m), 
so even the locally nonrandom generator of exercise 14 will look good on the serial test 
over the full period; it appears that we should insist on an extremely small discrepancy.] 

20 - Eo<*<m \( x - s(x))/m]F(s(x) - 5(s(x)))/m]/m = Eo < x <m(( x - s{x))/m + 
(((6x+c)/m))+£)((s(x) — s(s+)))/m+((a(&x+c)/m))+£)/m; and x/m — ((x/m))+ 
h ~ h 6 ( x / m ), s(x)/m = (((ax + c)/m)) + £ — §<5 {{ax + c)/m), s(s(x))/m = (((a 2 x + 
ac + c)/m)) -j- \ ^ 6((a 2 x -\- ac -\- c)/m). Let s(x') = s(x") = 0 and d = gcd(6, m). 

The sum now reduces to 


- + — 
4 12m 


(5i—52 + 53— ^ 4 —[-55—56+57 — 5g + Sg ) + 


d 

2m 




+ 


1 


2m' 


x — x 


m 


ft 




where S\ — a(a,m,c), S 2 — a(a 2 ,m,ac + c), S 3 — cr(ab,m,ac), S 4 — <r(l,m, 0) = 
(m — l)(m — 2)/m, 5s = a(a, m, c), Se = a(b,m,c), 5z = —er(a'— l,m, a'c), and 
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Ss = — a(a'(a' — 1 ),m, (a') 2 c), if a'a = 1 (modulo m); and finally 



where cq — cmodd. The grand total will be near ^ when d is small and when the 
fractions a/m, (fl 2 modm)/m, (abmodm)/m, b/m, (a' — 1 )/m, (a'(a' — 1)modm)/m, 
((ad) mod m)/m all have small partial quotients. (Note that a' — 1 = —b b 2 — • • •, 
cf. exercise 3.2.1.3-7.) 

21. C = (s — (J) 2 )/(£ — (^) 2 ), where 


-r 

= /‘ 


x{ax -j- 9}dx = 


1(1 

o 2 ^3 


2’ 2 I’ 


if X n — 


n — 0 


x{ax + 0} dx = s 0 4- si -|-1- s a _ 


... +/■ 

J —0 / a 


(ax 4- 0) dx 


JL _ 0_ a — 1 0 2 

3 a 2 a 4a 2a 


Therefore C = (1 — 60 4- 60 2 )/a. 

22. Let [u, v) denote the set {x \ u < x < v}. We have s(z) < x in the disjoint 


intervals 


1—0 1—0 
a ’a — 1 


2 — 0 2 — 0 
a ’a — 1 




which have total length 


+ E (S)- E 

0<j<a —1 v ' 0<j<o V 


-1 + 


1±1 + 9 = 1 . 
2 2 


23. We have s(s(x)) < s(x) < x when x is in [ i ^, and ax 4" 0 — k is in 
£Ef), for 0 < j < k < a; or when x is in l) and ax 0 — a is either in 

[^, for 0 < j < [a0 J or in —- ,0 ). The desired probability is 




0 < j < k < a 


0 < j < [a0\ 


(a-1) 


1 , i 0,1/|a0|(|a0J 4-1 — 20) , , f ^ 

6 6a 2a~^~ ( ^=T)-+ nrnx(0, {a0} + 0 


which is g + (1 — 30 + 30 2 )/6 a 4- 0(1/a 2 ) for large a. Note that 1 — 30 4* 30 2 > 
so 0 can’t be chosen to make this probability come out right. 
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24. Proceed as in the previous exercise; the sum of the interval lengths is 


E 

0<ii < — 1 <a 


jl 

a t ~ 1 (a — 1) 


1 fa-\-t — 2\ 
a t—1 (a — 1) y t J 


To compute the average length, let pk be the probability of a run of length > fc; the 
average is 


E pfc 

fc>i 




a 


a — 1 


The value for a truly random sequence would be e — 1; and our value is e — 1 -f- 
(e/2 — l)/a -f- 0(l/a 2 ). [Note.' The same result holds for an ascending run, since we 
have U n > U n + 1 if and only if 1 — U n < 1 — t/ n +i- This would lead us to suspect 
that runs in linear congruential sequences might be slightly longer than normal, so the 
run test should be applied to such generators.] 

25. x must be in the interval [(fc a' — 8)/a, (fc -f- 0' — 8)/a) for some k, and also in 
the interval [a, 0). Let fco = fact + 8 — /?'], k\ = \a(3 -\- 8 — 0']. With due regard to 
boundary conditions, we get the probability 

(ki — k 0 ){P' — ol)/cl -f- max(0, (3 — (fci + a' — 8)/a) — max(0, a — (fco + a' — 8)/a). 


This is (0 — a)(0' — a') + e, where |e| < 2(0' — a 1 )/a. 

26. See Fig. A-l; the orderings JJ\ < U 3 < U 2 and U 2 < U 3 < Ui are impossible; 
the other four each have probability 

27. U n = {F n — iUq + F n U\}. We need to have both Fk—iUo + FkUi < 1 and 
FkUo -\~Fk+iU\ > 1. The half-unit-square in which Uo > U\ is broken up as shown in 
Fig. A-2, with various values of fc indicated. The probability for a run of length fc is 5, 
if fc = 1; \/Fk~ iFfc+i — l/FkFic+ 2 , if fc > 1. The corresponding probabilities for a 
random sequence are 2fc/(fc -f 1)! — 2(fc + 1 )/(fc + 2)!; the following table compares the 
first few values. 


fc: 

Probability in Fibonacci case: 
Probability in random case: 


1 

2 

3 

4 

5 

1 

1 

1 

1 

1 

5 

3 

TO 

or 

05 

1 

5 

11 

19 

29 

3 

T2 

00 

300 

3000 




542 ANSWERS TO EXERCISES 


3.3.3 



28. Fig. A-3 shows the various regions in the general case. The “213” region means 
U 2 < U\ < 1 / 3 , if U\ and U 2 are chosen at random; the “321” region means that 
U 3 < U 2 < U\, etc. The probabilities for 123 and 321 are \ — a/2 -f a 2 /2; the 
probabilities for all other cases are | + a/4 — a 2 /4. To have all equal to g, we must 
have 1 — 6 a + 6 a 2 = 0. [This exercise establishes a theorem due to J. N. Franklin, 
Math. Comp. 17 (1963), 28-59, Theorem 13; other results of Franklin’s paper are related 
to exercises 22 and 23.] 


SECTION 3.3.4 

1 . v\ is always m and pi = 2 , for generators of maximum period. 

2 . Let V be the matrix whose rows are Vi,..., Vt. To minimize Y ■ Y, subject to 
the condition that Y 7 ^ (0,..., 0) and VY is an integer column vector X, is equivalent 
to minimizing (V~ 1 X) ■ (V~ 1 X), subject to the condition that X is a nonzero integer 
column vector. The columns of V" -1 are U\, ..., Ut- 
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3 .a 2 = 2 a — 1 and a 3 = 3a — 2 (modulo m). By considering all short solutions 
of (15), we find that vl — 6 and v\ = 4, for the respective vectors (1, —2,1) and 
( 1 ,— 1 ,— 1 , 1 ), except in the following cases: m — 2 e q, q odd, e > 3 , a = 2 e ~ 1 
(modulo 2 e ), a = 1 (modulo q), v\ = v\ = 2 ; m = 3 e q, gcd( 3 , (?) = 1 , e > 2 , a = 
1 i 3 e—1 (modulo 3 e ), a = 1 (modulo <?), = 2; m = 9, a = 4 or 7, z^ = v\ = 5. 

4. (a) The unique choice for ( 11 , 12 ) is ^(^ 1^22 — 3 / 2 ^ 21 , —y\U \2 + P 2 W 11 ), and this 
is = (j/i U 22 + IJ 2 CLU 22 , — yiU \2 — y 2 Q'U\ 2 ) = ( 0 , 0 ) (modulo 1 ); i.e., x\ and X 2 are 
integers, (b) When (xi,x 2 ) 7 ^ (0,0), we have (xiu lx + X 2 U 21) 2 + ( x iUi 2 + X 2 U 22) 2 = 
x K u i\ + ^ 12 ) + x 2 ( w 2 i 4 - ^ 22 ) + 2 x 12 : 2 (^ 11^21 + U 12 U 22 ), and by hypothesis this is 
> ( X 1 + x 2 ~ \XiX 2 \)(Wu + Uf 2 ) > uh + u\ 2 . [Note that this is a stronger result 
than Lemma A, which tells us only that x i < ( u ii 4~ ^ 12 X ^21 + u\ 2 )/^ 2 and that 
x\ < (uh -\-u\ 2 ) 2 /m 2 , where the latter can be > 1 . The idea is essentially Gauss’s 
notion of a reduced binary quadratic form, Disq. Arith. (Leipzig: 1801), §171.] 

5. Conditions (30) remain invariant; hence h cannot be zero in step S 2 , when a is 
relatively prime to m. Since h always decreases in that step, S 2 eventually terminates 
with u 2 -f- v 2 > s. Note that pp' < 0 throughout the calculation. 

The hinted inequality surely holds the first time step S2 is performed. The integer 
q' that minimizes (h 1 — qh) 2 + (p' — q'p) 2 is q' = round((/i'/i + p'p)/(h 2 -(- P 2 ))> by 
(24). If (h' — q'h) 2 -(- (p' — q'p) 2 < h 2 + p 2 we must have q' 7 ^ 0, q' 7 ^ —1, hence 
( p' — q'p) 2 > p 2 , hence (h' — q'h,) 2 < h 2 , i.e., | h! — q'h\ < h, i.e., q’ is q or q 1 . 
We have hu + pv > h(h' — q'h) -f- p(p' — q'p) > —^( h 2 -f- p 2 ), so if u 2 + v 2 <5 
the next iteration of step S2 will preserve the assumption in the hint. u 2 v 2 > 
s > (u — h) 2 + (^ — p) 2 , we have 2j h(u — h)-\- p(v — p)\ = 2 (h(h — u) + p(p — v)) = 
(u — h ) 2 -f- (v — p) 2 -\~ h 2 -\~P 2 — (u 2 + v 2 ) < (u — h) 2 -(- (u — p) 2 <h 2 -\- p 2 , hence 
(u — h) 2 -\-(v — p) 2 is minimal by exercise 4. Finally if both u 2 -\-v 2 and ( u — h) 2 -\-(v — p) 2 
are > s, let u' — h' — q'h, v' — p' — q'p; then 2| hu' + pv’\ < h 2 + p 2 < u' 2 + v' 2 , 
and h 2 -j- p 2 is minimal by exercise 4. 

6. If u 2 -\-v 2 > 5 > ( u — h) 2 -\-(v — p) 2 in the previous answer, we have (u— p) 2 > v 2 , 
hence (u — h) 2 < u 2 ; and if q — aj, so that h' = a 3 h u, we must have %_(_ 1 = 1. 
It follows that u 2 — mino <j<t(m 2 -\-p 2 _ x ), in the notation of exercise 3.3.3-16. 

Now we have mo = m 3 pj + m j+iPj —1 = a j m jPj —1 4~ m jPj —2 + mj-\-iPj—i < 
(dj 4- 1 + l/dj)mjPj—i < (A-\- 1 4" A)m 3 pj— 1 , and m 2 + p 2 _ x > 2mjPj—i, hence 
the result. 

7. We shall prove, using condition (19), that U 3 -Uk = 0 for all k 7 ^ j iff Vj ■ 14 = 0 

for all k 7 ^ j. Assume that Uj • Uk = 0 for all k 7 ^ j, and let U 3 = a x Vi -\ - \- a t Vt. 

Then Uj - Uk — ak for all k, hence U 3 = a 3 V 3 , and V 3 -Vk — • 14) = 0 for all 

k 7 ^ j. A symmetric argument proves the converse. 

8. Clearly vt +1 < vt (a fact used implicitly in Algorithm S, since s is not changed 
when t increases). For t — 2 this is equivalent to (m^fir) 1 ^ 2 > (Imps/ir) 1 / 3 , i.e., 

< \\Jmfixp?^ 2 . This reduces to |10“ 4 /v4F with the given parameters, but for 
large m and fixed p ,2 the bound (39) is better. 

9. Let /(pi ,... ,y t ) — 9; then gcd(pi,..., y t ) = 1, so there is an integer matrix W of 
determinant 1 having (wi ,..., w t ) as its first row. (Prove the latter fact by induction 
on the magnitude of the smallest nonzero entry in the row.) Now if X = (xi,..., x t ) 
is a row vector, we have XW = X' iff X = X'W~ l , and W ~ 1 is an integer matrix of 
determinant 1, hence the form g defined by WU satisfies g(x 1 ,..., xt) = f(x [,..., x' t ); 
furthermore g(l, 0,..., 0) = 9. 
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Without loss of generality, assume that / = g. If now S is any orthogonal matrix, 
the matrix US defines the same form as U, since (XUS)(XUS) T = (XU)(XU ) T . 
Choosing S so that its first column is a multiple of V\ and its other columns are any 
suitable vectors, we have 



for some ot\, ct 2 , ..., a t and some (t — 1 ) X (t — 1) matrix U' . Hence f(x i,..., x t ) = 
(c*iXi ottXt) 2 + h(x 2 ,..., xt). It follows that ai — \f9 [in fact, ctj — (U\ • 

Uj)fVe for 1 < j < t], and that h is a positive definite quadratic form defined by 

U', where det U' = (det U)/V$. By induction on t, there are integers (x 2 , • ■ ■, Xt) with 
h(x 2 ,-",Xt) < 2 )/ 2 | det 1 )/^ 1 /( t 1 ) j an( j f or these integer values we can 

choose xi so that \xi + ( 0 : 2 X 2 + • • • + a t x t )/ oi| < i.e., (oiXx + • • • -j- a t x t ) 2 < 

\9. Hence 6 < f(x 1 ,..., x t ) < \9 + 2 ^ /2 | detC/| 2//ft 1 )/^ 1 /C t and the desired 

inequality follows immediately. 

[Note: For t — 2 the result is best possible. For general t, Hermite’s theorem 
implies that p t < 7 r t ' / 2 ( 4 / 3 ) t ^ t “‘ 1 ^ 4 /(t/ 2 )!. A fundamental theorem due to Minkowski 
(“Every t-dimensional convex set symmetric about the origin with volume > 2 t contains 
a nonzero integer point”) gives fit < 2*; this is stronger than Hermite’s theorem for 
t > 9. Even stronger results are known, cf. (41).] 

10. Since 2/1 and 2/2 are relatively prime, we can solve u\p2 —U22/1 = m; furthermore 
(ui + qyi)y2 — (u2 -f- 32/2)2/1 — m for all q, so we can ensure that 2 \u\y\ + ^2^/21 < 
y\ + y\ by choosing an appropriate integer q. Now 2/2 (wi 4- ^2) = 2/2U1 — yiU2 = 0 
(modulo m), and y2 must be relatively prime to m, hence ui au2 = 0. Finally 
let \uiyi 4- U2y2\ — am, u\ -\- u\ = ( 3 m, y\ + y\ = 7m; we have 0 < a < §7, 
and it remains to be shown that a < \(3 and ( 3 ^ > 1. The identity (u\p2 — U22/1) 2 + 
(uiyi + U22/2) 2 = (uf + u 2 2 )( y\ + 2/2 ) implies that 1 4- a 2 = /fry. If ct > we have 
2 ch > 1 4- a 2 , i.e., 7 — >/7 2 — 1 < a < ^7. But 57 < \J^f 2 — 1 implies that 
7 2 > |, a contradiction. 

11. Since a is odd, 1/1 4 * 2/2 must be even. To avoid solutions with 2/1 and 1/2 both even, 
let yi = xi 4 -X 2 , 2/2 = xi —X 2 , and solve x 2 4 -X 2 = m/V 3 — e, with (xi, X 2 ) relatively 
prime and Xi even; the corresponding multiplier a will be the solution to (x 2 — Xi)a = 
X 2 4“ xi (modulo 2 e ). It is not difficult to prove that a = 1 (modulo 2 fc+1 ) iff xi ~ 0 
(modulo 2 k ), so we get the best potency when Xi mod 4 = 2. The problem reduces to 
finding relatively prime solutions to x 2 4~ ^2 = N where N is a large integer of the 
form 4fc 4“ 1 • By factoring N over the Gaussian integers, we can see that solutions exist 
if and only if each prime factor of N (over the usual integers) has the form 4k 4- 1- 

According to a famous theorem of Fermat, every prime p of the form 4k 1 can 
be written p = u 2 -\-v 2 = (u-\-iv)(u — iv), v even, in a unique way except for the signs 
of u and v. The numbers u and v can be calculated efficiently by solving x 2 = — 1 
(modulo p), then calculating u -\- iv = gcd(x 4~ i,p) by Euclid’s algorithm over the 
Gaussian integers. [We can take x = n ^ p ~ 1)/4 modp for almost half of all integers n. 
This application of a Euclidean algorithm is essentially the same as finding the least 
nonzero u 2 4~ u 2 such that u 4i xv = 0 (modulo p).] If the prime factorization of 
N is pj 1 .. .Pr r = (ui 4- iv\) ei (u\ — uu) ei .. ,(u r 4 - iv r ) eT (u r — iv r ) eT , we get 2 r ~ 1 
distinct solutions to x 2 4“ ^2 = N, gcd(xi, X 2 ) = 1, xi even, by letting |x 2 [ 4- i|xi| — 
(ui 4- ivi) ei (u 2 4 z^ 2 ) 62 - - ■ (u r ±iv r ) er ’, and all such solutions are obtained in this way. 
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Note: When m = 10 e , a similar procedure can be used, but it is five times as 
much work since we must keep trying until finding a solution with x\ = 0 (modulo 10). 
For example, when m = 10 10 we have \rnj\J\ 3J = 5773502691, and 5773502689 = 
53 • 108934013 = (7 -f 2i)(7 — 2z')(2203 + 10202z)(2203 — 10202z). Of the two solutions 
|x 2 | + z|xij = (7 -f- 2z}(2203 -f 10202z) or (7 -\- 2z)(2203 — 10202z), the former gives 
\xi | = 67008 (no good) and the latter gives |xi \ = 75820, \x 2 \ = 4983 (which is usable). 
Line 9 of Table 1 was obtained by taking x\ — 75820, x 2 — —4983. 

Line 20 of the table was obtained as follows: |_2 35 /\/3j = 19837604196; we drop 
down to 19837604193, which is divisible by 3 so it is ineligible. Similarly, 19837604189 
is divisible by 19, and 19837604185 by 7, and 19837604181 by 3; but 19837604177 is 
prime and equals 131884 2 -J- 49439 2 . The corresponding multiplier is 1175245817; a 
better one could be found if we continued searching. The multiplier on line 24 is the 
best of the first sixteen multipliers found by this procedure when m = 2 32 . 

12. U/ ■ U s ' = Uj-Uj + 2 qi(Ui ■ U,) + £ • U k ). The partial 

derivative with respect to qk is twice the left-hand side of (26). If the minimum can be 
achieved, these partial derivatives must all vanish. 

13. uu = 1, u 2 i = irrational, u \ 2 = u 22 — 0. 

14. After three Euclidean steps we find = 5 2 -j- 5 2 , then S4 produces 

/ —5 5 0\ /—2 18 38\ 

U = —18 —2 0 , V = I — 5 —5 —5 . 

V 1 —2 1/ V o 0 100/ 

Transformations (j, q lf q 2 , q 3 ) = (1, *, 0, 2), (2, —4, *, 1), (3,0,0, *), (1, *, 0,0) result in 

/—3 1 2\ /—22 —2 18\ 

U = I -5 -8 7 ), V — j 5 — 5 -5 , Z = (0 0 1). 

V 1 — 2 lj V 9 —31 29/ 

Thus = \/6, as we already knew from exercise 3. 

15. The largest achievable q in (11), minus the smallest achievable, plus 1, is |iti| -f- 

- \~\u t \ — 6 , where ^ = 1 if u x Uj < 0 for some i and j, otherwise <5 = 0. For example 

if t = 5, ux > 0, u 2 > 0, u s > 0, 1/4 = 0, and 1/5 < 0, the largest achievable value is 
q = ui + u 2 + 1/3 — 1 and the smallest is q = 1/5 ~b 1 = —[ 1 / 51 + 1. 

[Note that the number of hyperplanes is unchanged when c varies, hence the 
same answer applies to the problem of covering L instead of Lq. However, the stated 
formula is not always exact for covering Lo, since the hyperplanes that intersect the 
unit hypercube may not all contain points of Lq. In the example above, we can never 
achieve the value q = U\ -J- u 2 U 3 — 1 in Lo if u\ u 2 -f- 1/3 > m; it is achievable iff 
there is a solution tom — ui — u 2 —ii 3 = x\U\ + x 2 u 2 + X 3 U 3 + x^\u^ \ in nonnegative 
integers (xi, x 2 , X 3 , X 4 ). It may be true that the stated limits are always achievable 
when |uij —| -f |u t | is minimal, but this does not appear to be obvious.] 

16. It suffices to determine all solutions to (15) having minimum |wi| + ••• + |ut|, 
subtracting 1 if any one of these solutions has components of opposite sign. 

Instead of positive definite quadratic forms, we work with the somewhat similar 

function f(x 1 ,.. .,x t ) — \xiUi-\ -1- x t U t \, defining |y| = \yi\-\ - \-\yt\- Inequality 

(21) can be replaced by jx fc | < (maxi<j< t \v k j\)f{yi ,... ,Vt)- 

Thus a workable algorithm can be obtained as follows. Replace steps SI through 
S3 by: “Set U <— (m), V <— (1), r <— 1, s <— m, t +— 1.” (Here U and V are lXl 
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matrices; thus the two-dimensional case will be handled by the general method. A 
special procedure for t = 2 could, of course, be devised.) In steps S4 and S8, set 
5 +- min(s, |C4|). In step S8, set Zk <— [ max i<i<* \ v kj\s/m\. In step S10, set s <— 
min(s, \Y\ — 6 ); and in step SI 1, output s = N t . Otherwise leavQ the algorithm as 
it stands, since it already produces suitably short vectors. [Math. Comp. 29 (1975), 
827-833.] 

17. When k > t in S10, and if Y Y < s, output Y and — Y ; furthermore if Y-Y < s, 
take back the previous output of vectors for this t. [In the author’s experience preparing 
Table 1, there was exactly one vector (and its negative) output for each v t , except when 
2/i = 0 or y t = 0.] 

18. (a) Let x = m, y = (1 — m)/3, Vi 3 = y -\- x 8 i3 , Ui 3 = —y -f- 6 {j. Then V 3 -14 — 
|(m 2 — 1) for j 4 fc, V k • Vfc = §(m 2 -f £), U 3 • U 3 = \{m 2 -f 2), z k « >/J m. (This 
example satisfies (28) with a = 1 and works for all m = 1 (modulo 3).) 

(b) Interchange the roles of U and V in step S5. Also set s <— min(s, U% ■ Ui ) for all 
Ui that change. For example, when m — 64 this transformation with j = 1, applied 
to the matrices of (a), reduces 


to 



—21 

—21 

43 

—21 

—21 

43 


1 ^ 

43 —21 , 

—21 43/ 


/22 21 21 \ 
U = l 21 22 21 

V21 21 22/ 

/ 22 21 21 \ 
U = ( -1 l 0 . 

V-i 0 1 / 


[Since the transformation can increase the length of V 3 , an algorithm that incorporates 
both transformations must be careful to avoid infinite looping. See also exercise 23.] 


19. No, since a product of non-identity matrices with all off-diagonal elements non¬ 
negative and all diagonal elements 1 cannot be the identity. 

[However, looping would be possible if a subsequent transformation with q = —1 
were performed when —2 Vi • V 3 = V 3 * V 3 ; the rounding rule must be asymmetric with 
respect to sign if non-shortening transformations are allowed.] 

20. Use the ordinary spectral test for a and m — 2 e ~ 2 ; cf. exercise 3.2.1.2-9. [On 
intuitive grounds, the same answer should apply also when a mod 8 = 3.] 

21. Xin+4 = Xtn (modulo 4), so it is now appropriate to let Vi = (4, 4a 2 ,4a 3 )/m, 
V 2 = ( 0 , 1 , 0 ), V 3 = ( 0 , 0 , 1 ) define the corresponding lattice Lq. 


24. Let m — p) an analysis paralleling the text can be given. For example, when 
t = 4 we have X n +3 — ((a 2 + 6)X n+1 -f- a&Xn) mod n, and we want to minimize 
u i + u 2 + u i + u \ 7 ^ 0 suc h u i + bu 3 abui = U 2 -\- auz -[- (a 2 4- b)ii4 = 0 
(modulo m). 

Replace steps SI through S3 by the operations of setting 


U 





t <- 


2 , 


and outputting U 2 = m. Replace step S4 by 
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S4'. [Advance t.] If t = T, the algorithm terminates. Otherwise set t <— i-j-1 and R <— 
R( i a)modm. Set Ut to the new row (—r i 2 , —r 2 2 , 0, ... ,0,1) of £ elements, and 
set uu <— 0 for 1 < i < t. Set Vt to the new row (0,..., 0, m). For 1 < i < t, set 
q «- round((viiri 2 +^i 2 r 2 2 )/m), «- v»iri 2 + v i2 r 2 2 — qm, and Ut <- U t + qU t . 

Finally set s <— min(s, U t • U t ), k <— t, j <— 1. 

[A similar generalization applies to all sequences of length p k — 1 defined by Eq. 
3.2.2-8. Additional numerical examples have been given by A. Grube, Zeitschrift fiir 
angewandte Math, und Mechanik 53 (1973), T223-T225.] 

25. The given sum is at most two times the quantity Ylo<k<m/ 2 d r (^) = 1 + 
J/(m/d), where 


f(m) = — 
m 



csc(irk/m ) 


1 < fc < m/2 


1 

m 



esc {'Kx/m) dx -f O 




[When d = 1, we have X/o<fc<m r (^) = (2/7r) In m 1 -f- (2 /tt) ln(2e/7r) -f- 0(l/m).] 

26. When m = 1, we cannot use (52) since k will be zero. If gcd(q, m) = d, the same 

derivation goes through with m replaced by m/d. Suppose we have m = pi 1 ...p* r 
and gcd(a — 1, m) = pf 1 ... p( r and d = pf 1 ... pf r . If m is replaced by m/d, then s is 
replaced by .. p max(o,e r -/ r -d r)> 

27. It is convenient to use the following functions: p(x) = 1 if x = 0, p(x) = x if 
0 < x < m/2, p(x) — m — x if m/2 < x < m; trunc(z) — \x/^\ if 0 < x < m/2, 
trunc(x) = m — |_( m — x )/^\ if m/2 < x < m; L(x) = 0 if x — 0, L(x) = [lg xj —{— 1 if 
0 < x < m/2, L(x) = —(|_lg(m—x)J —(~l) if m/2 < x < m; and !(x) — max(l, 2^l —1 ). 
Note that ^(L(x)) < p(x) < 2 l(L(x)) and p(x) < msin(7rx/m) < 7rp(x) for 0 < x < m. 

Say that a vector (ui ,..., ut) is bad if it is nonzero and satisfies (15); and let p ma x 
be the maximum value of p{u\) ... p(u t ) over all bad (ui, ..., ut). The vector (ui,..., u t ) 
is said to be in class (L(ui),... ,L(ut)). Thus there are at most (21gm-}-l) t classes, and 
class (Li,... ,L t ) contains at most 1{L \).. ,l(L t ) vectors. Our proof is based on showing 
that the bad vectors in each fixed class contribute only 0(l/p ma x) to r i u i>' ■ • > r t); 
this proves even more than was asked, since 1/pmax < 7rV ma x- 

Let p = [lgpmaxj- The p-fold truncation operator on a vector is defined to be 
the following operation repeated p times: “Let j be minimal such that p(uj ) > 1, 
and replace Uj by trunc(uj); but do nothing if p{u 3 ) = 1 for all j.” (This operation 
essentially throws away one bit of information about (ui ,..., u t )•) If (u [,..., u[) and 
{u’l ,..., u") are two vectors of the same class having the same /z-fold truncation, we say 
they are similar ; in this case it follows that p(u[ — u ")... p{u' t — u'/) < 2 M < p ma x- For 
example, any two vectors of the form ((lx 2 xi) 2 , 0, m — ( 123 ) 2 , ( 101 x 5 X 4 ) 2 , (1101) 2 ) are 
similar when m is large and p — 5; the pi-fold truncation operator successively removes 
Xi, x 2 , 23 , X 4 , X 5 . Since the difference of two bad vectors satisfies (15), it is impossible 
for two unequal bad vectors to be similar. Therefore class (Li,...,L t ) can contain 
at most max(l,/(Li).. J(L t )/2 M ) bad vectors. If class (Li,...,L t ) contains exactly 
one bad vector (tti,..., itt), we have r(u\,... ,u t ) < r max < 1/pmax; if it contains 
< l(Li )... Z(L t )/2 M bad vectors, each of them has r(ui,... ,u t ) < l/p(wi)... p(u t ) < 
1 
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28. Let <; = e 27ri /( m ~ 1 ) and i et S ki = Eo< J <m-i w* i+ V fc - The analog of (51) is 

ISjtol = \/ra, hence the analog of (53) is |jV _ 1 Eo<n</v a,Xn I = 0({y/m\ogm)/N). 
The analogous theorem now states that 

d M = Q ( ^ (1 ° r ) + j + o((logm)‘ r ma x), £>£_, = 0((logro)* w). 


In fact, E r(ui,..., Ut) [summed over nonzero solutions of (15)] -f- 

r ( w i» • • • > u t) [summed over all nonzero (iti,..., u t )]. The latter sum is O(logra)* 
by exercise 25 with d — 1, and the former sum is treated as in exercise 27. 

Let us now consider the quantity R(a) = E r ( u i> • • • ,ut) summed over nonzero 
solutions of (15). Since m is prime, each {u \,... ,u t ) can be a solution to (15) for at 
most t — 1 values of a, hence Eo<a<m ^( a ) ^ (£ — 1) E r ( w c •.. ,t*t) = O^logra) 4 ). 
It follows that the average value of R(a ) taken over all <p(ra — 1) primitive roots is 
0 (t(\ogmy / (p(m — 1)). 

Note: In general l/(p(n) = O(loglogn/n); we have therefore proved that for all 
prime m and for all T there exists a primitive root a modulo m such that the linear 
congruential sequence (1, a, 0, m) has discrepancy = 0(m'~ 1 T(logm) T log log ra) 

for 1 < t < T. This method of proof does not extend to a similar result for linear con¬ 
gruential generators of period 2 e modulo 2 e , since for example the vector (1, —3,3, —1) 
solves (15) for about 2 2e//3 values of a. 

29. u\ + * * * + u\ > is 2 > 2(t — 1) implies that p(ui )... p(u t ) > uj — (t — 1) > 
Vt! \/2, in the notation of exercise 27. 

30. We wish to minimize q\aq — mp\ for 1 < q < m and 0 < p < a. In the notation of 
exercise 4.5.3-42, we have aq n — mp n = (— l) n Q 9 -n-i{a n + 2 , ■ ■ ■, a s ) for 0 < n < s. 
In the range q n —\ < q < q n we have \aq — mp\ > |aq n —i — mp n — 1 |; consequently 
q\aq — mp\ > q n -i\aq n —i— mp n — 1 |, and the minimum is mino<n< s q n \aq n —mp n \ = 
mino< n <s Qn{ai, • • •, o, n )Qs—n—i{a n + 2 , ■ ■ •, a s ). By exercise 4.5.3-32 we have m — 

Qn{(li, . . . , fln)fln+lQs—n—1 (u n +2j ■ ■ • , &s) 1, ■ • ■ , CLn)Qs — n — 2{(Tn+3i • * ■ j fla) “1“ 

Q n —i(ai ,..., a n — i)Qs—n— ua n -h 2 , • • ■, a s ); and our problem is essentially that of max¬ 
imizing the quantity m/Q n {cL \,..., a n )Qs— n — i(n + 2,..., a s ), which lies between a n +i 
and a n -(-i + 2. 


31. Equivalently, the conjecture is that all large m can be written m = Q n {a 1 ,..., o n ) 
for some n and some a x G (1, 2,3}. For fixed n the 3 n numbers Q n {a \,..., a n ) have an 
average value of order (1 -(- \/2) n , and their standard deviation is of order (2.51527) n ; 
so the conjecture is almost surely true. S. K. Zaremba conjectured in 1972 that all m 
can be represented with a* < 5; T. W. Cusick made some progress on this problem in 
Mathematika 24 (1977), 166-172. It appears that only the cases m = 54 and m — 150 
require a* = 5, and the largest m’s that require 4’s are 2052, 2370, 5052, and 6234; at 
least, the author has found representations with a z < 3 for all other integers less than 
2000000. When we require a z < 2, the average of Q n (ai,..., a n ) is | 2 n -f- £(— 2)~ n , 
while the standard deviation grows as (2.04033) n . The density of such numbers in the 
author’s experiments (which considered 2 6 blocks of 2 14 numbers each, for m < 2 20 ) 
appears to vary between .50 and .65. 

[See I. Borosh and H. Niederreiter, to appear, for a computational method that 
finds multipliers with small partial quotients. They have found 2-bounded solutions 


with m ™ 2 e for 25 < e < 


35.] 
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1 


V 


Fig. A-4. Region of “acceptance” 
for the algorithm of exercise 6. ~ U x 1 


SECTION 3.4.1 

1 ,a + (P — a)U. 

2. Let U = X/m; [_ k{J\ = r iff r < kX/m < r -(- 1 iff mr/k < X < m(r -j- 1 )/k 
iff \mr/k] < X < \m(r -{- 1 )/k~\. The exact probability is given by the formula 
(l/m)([m(r -j- l)/fc] — [mr/fc]) = 1 (k -(- e, where |e| < 1/m. 

3 . If full-word random numbers are given, the result will be sufficiently random as in 
exercise 2. But if a linear congruential sequence is used, k must be relatively prime to 
the modulus m, lest the numbers have a very short period, by the results of Section 
3. 2. 1 . 1 . For example, if k — 2 and m is even, the numbers will at best be alternately 0 
and 1. The method is slower than (1) in nearly every case, so it is not recommended. 

4 . max(Xi,X2) < x if and only if Xi < x and X2 < x; min(Xi,X2) > x if and 
only if Xi > x and X2 > x. The probability that two independent events both happen 
is the product of the individual probabilities. 

5 . Obtain independent uniform deviates Ui, C/2 - Set X <— U 2. If U\ > p, set 
X «— max(X, 1/3), where U3 is a third uniform deviate. If C/i > p -j- q, also set 
X <— max(X, 1/4), where C/4 is a fourth uniform deviate. This method can obviously 
be generalized to any polynomial, and indeed even to infinite power series (as shown 
for example in Algorithm S, which uses minimization instead of maximization). 

We could also proceed as follows (suggested by M. D. MacLaren): If Ui < p, set 
X «— Ui/p; otherwise if JJ\ < p + q, set X <— max((£/i —p)/q, C/2); otherwise set 
X «— max((£/i — p — q)/r, U2, 1/3). This method requires less time than the other to 
obtain the uniform deviates, although it involves further arithmetical operations and 
it is slightly less stable numerically. 

6 . F(x) = A\/{A\ + A2), where Ai and A2 are the areas in Fig. A- 4 ; so 

y 2 dy 2 2 , - 

■ - = - arcsin x -j—xvl — x 2 . 

y 2 dy 71 77 

The probability of termination at step 2 is p = 7 t/ 4 , each time step 2 is encountered, so 
the number of executions of step 2 has the geometric distribution. The characteristics 
of this number are (min 1 , ave 4 / 7 r, max 00, dev ( 4 / 7 t)^/ 1 — 7 t/ 4 ), by exercise 17 . 



Circle U 2 +V 2 = 1 
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7. If k — 1 , then ni = n and the problem is trivial. Otherwise it is always possible 
to find i 7 ^ j such that n* < n < n 3 . Fill Bi with n* cubes of color Ci and n — rii 
of color Cj, then decrease n 3 by n — rii and eliminate color Ci. We are left with the 
same sort of problem but with k reduced by 1; by induction, it’s possible. 

The following algorithm can be used to compute the P and Y tables: Form a list of 
pairs (pi, 1)... (pfc, k ) and sort it by first components, obtaining a list (qi, ai)... (qk, a>k) 
where q\ < • • • < qk and n = k. Repeat the following operation until n = 0: Set 
P[ai — 1] «— kqi and Y[a\ — 1] <— x an . Delete (<?i,ai) and ( q n ,a n ), then insert the 
new entry ( q n — ( 1 /k — qi), a n ) into its proper place in the list and decrease n by 1. 

(If p 3 < 1/k the algorithm will never put Xj in the Y table; this fact is used 
implicitly in Algorithm M. The algorithm attempts to maximize the probability that 
V < Pk in (3), by always robbing from the richest remaining element and giving it 
to the poorest. However, it is very difficult to determine the absolute minimum of this 
probability, since such a task is at least as difficult as the “bin-packing problem”; cf. 
Chapter 7.) 

8 . Replace Pj by (j -(- Pj)/k for 0 < j < k. 

9. Consider the sign of f"{x) = yj 2 / 7 r(x 2 — l)e~ x2 . 


10 . Let Sj = O' - l)/5 for 1 < j < 16 and p j+15 = F{S J+1 ) — F(S 3 ) - p 3 for 
1 < j < 15; also let p 3 i = 1 — ^(3) and p 32 = 0. (Eq. (15) defines pi, ..., pis.) The 
algorithm of exercise 7 can now be used with k = 32 to compute P 3 and Yj, after which 
we will have 1 < Y 3 < 15 for 1 < j < 32. Set P 0 +- P 32 (which is 0) and Y 0 y 32 . 
Then set Z 3 1/(5 — 5 Pj) and Yj •*— fY) — Z 3 for 0 < j < 32; Q 3 <— 1/(5 P 3 ) for 
1 < j < 15. 


Let h — | and f 3 + 15 (x) = 


—i 2 /2 


-F/SO 


)/p 3 +i 5 for Sj < x < Sj + h. 


Then let a 3 — fj+is{S 3 ) for 1 < j < 5, b 3 = f 3 + 15 (S 3 ) for 6 < j < 15; also b 3 = 
— hf 3+15 (Sj + h) for 1 < j < 5, and a 3 = fj+is{xj) + (xj — Sj)bj/h for 6 < j < 15, 
where x 3 is the root of the equation f' j+ 15 (x 3 ) = —b 3 /h. Finally set D 3+l5 a 3 /b 3 

for 1 < j < 15 and E j+ls <- 25 /j for 1 < j < 5, E j+ls <- l/(e (2j - 1)/50 — 1) for 
6 < j < 15. 

Table 1 was computed while making use of the following intermediate values: 
(pi,...,p 3 i) = (.156, .147, .133, .116, .097, .078, .060, .044, .032, .022, .014, .009, .005, 

.003, .002, .002, .005, .007, .009, .010, .009, .009, .008, .006, .005, .004, .002, .002, .001, 

.001, .003); (z6, • • • ,zi 5 ) — (1.115, 1.304, 1.502, 1.700, 1.899, 2.099, 2.298, 2.497, 2.697, 
2.896); (ai,...,ai 5 ) = (7.5, 9.1, 9.5, 9.8, 9.9, 10.0, 10.0, 10.1, 10.1, 10.1, 10.1, 10.2, 

10.2, 10.2, 10.2); b u ...,b ls ) = (14.9, 11.7, 10.9, 10.4, 10.1, 10.1, 10.2, 10.3, 10.4, 10.5, 

10.6, 10.7, 10.7, 10.8, 10.9). 


11. Let g(t) = e g/2 te t2jl 2 for t > 3. Since G(x) — fj g(t)dt — 1 — e (l2 9)/2 , a 
random variable X with density g can be computed by setting X •*— G~ : (1 — V) = 

\/9 — 2 In V. Now e ~ 1 ^ 2 < (t/3)e~ 1 ! 2 for t > 3, so we obtain a valid rejection 
method if we accept X with probability f(X)/cg(X) = 3/X. 


12. We have f(x) = xf(x) —1 < 0 for x > 0, since f(x) = x 1 — e x 2//2 J x °°e f2//2 dt/t 2 
for x > 0. Let x — a 3 —\ and y 2 = x 2 + 2 In 2; then 


V^JT e t212 dt = %y/ 2 / 7 re x * / 2 f{y) < e x 2 / 2 f(x) = 2 \ 


hence y > a 3 . 
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13. Take b 3 — p 3 \ consider now the problem with fi 3 = 0 for each j. In matrix 

notation, if Y = AX, where A = we need AA T — C = ( d 3 ). (In other 

notation, if Y 3 = a 3 kXk, then the average value of Y{Y 3 is d t kdjk.) If this matrix 
equation can be solved for A, it can be solved when A is triangular, since A — BU 
for some orthogonal matrix U and some triangular B, and BB T = C. The desired 
triangular solution can be obtained by solving the equations ah = cn, aua 2 i = C 12 , 
a 2 i 4“ a 22 — c 22 , (I 11 O 31 — C 13 , d 2 idsi -f- £^22^32 — C 23 , • • •, successively for an, 021 , 
d 22 , dsi , a 32 , etc. [Note; The covariance matrix must be positive semidefinite, since 
the average value of (Y^VjYj ) 2 is which must be nonnegative. And there is 

always a solution when C is positive semidefinite, since C — U~ 1 diag(Xi,..., \ n )U, 
where the eigenvalues \ 3 are nonnegative, and f/~ 1 diag(\A7 ) ..., >/Ki,)U is a solution.] 

14. F(x/c ) if c > 0 , a step function if c = 0, and 1 — F{x/c) if c < 0. 

15. Distribution F\(x — t)dF 2 {t). Density /i(x — t)f 2 (t)dt. This is called 
the convolution of the given distributions. 

16. It is clear that f(t) < cg{t) for all t as required. Since / 0 °° g{t)dt = 1 we have 
g{t) — Ct a ~ 1 for 0 < t < 1, Ce~‘ for t > 1, where C = ae/{a + e). A random 
variable with density g is easy to obtain as a mixture of two distributions, G\(x) = x a 
for 0 < x < 1, and G 2 {x) = 1 — e 1—x for x > 1: 

Gl. [Initialize.] Set p +— e/{a - f- e). (This is the probability that Gi should be 
used.) 

G 2 . [Generate G deviate.] Generate independent uniform deviates U, V, where 
V 7 ^ 0. If U < p, set X +- V l/a and q •<— e ~ x ; otherwise set X 1 — In V 
and q *— X a ~ l . (Now X has density g, and q — f(X)/cg(X).) 

G3. [Reject?] Generate a new uniform deviate U. If U > q, return to G2. | 

The average number of iterations is c = (a + e)/(eP(a + 1)) < 1.4. 

It is possible to streamline this procedure in several ways. First, we can replace 
V by an exponential deviate Y of mean 1 , generated by Algorithm S, say, and then 
we set X *— e~ Y ! a or X <— 1 -{- Y in the two cases. Moreover, if we set q <— pe~ x 
in the first case and q <— p - f- (1 — p)X a ~ 1 in the second, we can use the original U 
instead of a newly generated one in step G3. Finally if U < p/e we can accept V 1 ^ a 
immediately, avoiding the calculation of q about 30 percent of the time. 

17. (a) F(x) — 1 — (1 — pjL*- 1 , for x > 0 . (b) G(z) = pz/( 1 — (1 — p)z). (c) 

Mean 1 /p, standard deviation yT — p/p. To do the latter calculation, observe that if 
H(z) = q + (1 - q)z, then H\ 1 ) = 1 - q and H"{ 1) + ^(1) - (H'(l )) 2 = q( 1 - q), 
so the mean and variance of 1/H(z) are q — 1 and q(q — 1 ), respectively. (See Section 
1.2.10.) In this case, q = 1 /p; the extra factor 2 in the numerator of G(z) increases 
the mean by one. 

18. Set N <— Ni N 2 — 1, where Ni and N 2 independently have the geometric 
distribution for probability p. (Consider the generating function.) 

19. Set N <— Ni -j- \- N t — t, where the Nj have the geometric distribution for p. 

(This is the number of failures before the tth success, when a sequence of independent 
trials are made each of which succeeds with probability p.) 

For t = p = 5 , and in general when the mean value (namely £(1 — p)/p) of the 
distribution is small, we can simply evaluate the probabilities p n — ( t— ^ +n )p s (l — p) n 
consecutively for n — 0 , 1 , 2 , ... as in the following algorithm: 
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Bl. [Initialize.] Set N «— 0, q p 4 , r <— q, and generate a random uniform 
deviate U. (We will have q — Pn and r = po + • • • 4 Pn during this 
algorithm, which stops as soon as U < r.) 

B2. [Search.] If U > r, set N N 4 1, q <— q(l — p)(t — 1 -[- N)/N, r r-\-q, 
and repeat this step. | 

[An interesting technique for the negative binomial distribution, for arbitrarily 
large real values of t, has been suggested by R. Leger: First generate a random gamma 
deviate X of order t, then let N be a random Poisson deviate of mean X(1 — p)/p.] 

20. R 1 = 1 -f- (1 — A/R) • Rl. When R2 is performed, the algorithm terminates with 
probability I/R; when R3 is performed, it goes to Rl with probability E/R. We have 


Rl 

R/A 

R/A 

R/A 

R/A 

R2 

0 

R/A 

0 

R/A 

R3 

0 

0 

R/A 

R/A—I/A 

R4 

R/A 

R/A-I/A 

R/A — E/A 

& 

1 

1 


21. R = y/sje « 1.71153; A = yJV/2 « 1.25331. Since 

J uVa — budu = (a — bu) 3/2 (%(a — bu) — §)/b 2 , 

we have I = f“ /b uy/a — bu du = ^a 5 / 2 / 6 2 where a = 4(1 -)- Inc) and b = 4c; when 

c = e 1/4 ,1 has its maximum value |\/5/e 1.13020. Finally the following integration 
formulas are needed for E: 

J y/bu — au 2 du = |& 2 a~ 3 / 2 arcsin( 2 'ua /6 — 1 ) -j- \ba~ 1 \/bu — au 2 (2 ua/b — 1 ), 

bu au 2 du = — %b 2 a~ 3 ^ 2 In Jbu -j- au 2 -j- Uy/a -f- b/2y/a/) 

4 \ba~ 1 y/bu 4 - au 2 (2 ua/b 4 - 1 ), 

where a, b > 0. Let the test in step R3 be “ X 2 > 4 e x ~' 1 /U — 4x”; then the 
exterior region hits the top of the rectangle when u = r{x) = ( e x — yje 2x — 2 ex)/ 2 ex. 
(Incidentally, r(x) reaches its maximum value at x — 1/2, a point where it is not 

differentiable!) We have E = $^ x \\/2je — y/bu — au 2 ) du where b = 4e I—1 and 
a = 4x. The maximum value of E occurs near x = —.35, where we have E « .29410. 

22. (Solution by G. Marsaglia.) Consider the “continuous Poisson distribution” de¬ 
fined by G(x) = e~ t t x ~ 1 dt/ r(z), for x > 0; if X has this distribution then 
[X\ is Poisson distributed, since G(x 4 1) — G(x) = e~^p x /x\. If p is large, G is 
approximately normal, hence G~ 1 (F tl (x)) is approximately linear, where F^(x) is the 
distribution function for a normal deviate with mean and variance p; i.e., F^x) = 
F((x — p)/y/JI), where F(x) is the normal distribution function (10). Let g(x) be an 
efficiently computable function such that \G~ ^F^x))— 0 (x)| < e for —oo < x < oo; 
we can now generate Poisson deviates efficiently as follows: Generate a normal deviate 
X, and set Y <- g(p 4- y/UX), N «- [Y\, M <— |Y + JJ. If \Y - M\ > e, output N; 
otherwise output M — 1 or M, according as G~ 1 (F(X)) < M or not. 
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This approach applies also to the binomial distribution, with 

G(x) = f + 

J p 

since [G —X (?7)J is binomial with parameters ( t,p ) and G is approximately normal. 

[See also the alternative method proposed by Ahrens and Dieter in Computing 
(1980), to appear.] 

23. Yes. The second method calculates |cos20|, where 0 is uniformly distributed 
between 0 and tt/ 2. (Let U — rcosO, V = rsinfl.) 

25. §5 = (. 10101 ) 2 . In general, the binary representation is formed by using 1 for V 
and 0 for A, from left to right, then suffixing 1 . This technique [cf. K. D. Tocher, J. 
Roy. Stat. Soc B-16 (1954), 49] can lead to efficient generation of independent bits 
having a given probability p, and it can also be applied to the geometric and binomial 
distributions. 

26. (a) True, £ fc Pr (7Vi = fc)Pr(W 2 =n — k) = + M 2 T/n !. (b) False, 

unless p 2 = 0; otherwise N\ — JV 2 might be negative. 

27. Let the binary representation of p be (.616263 ■.. ) 2 , and proceed according to the 
following rules: 

Bl. [Initialize.] Set m <— t, N +~ 0, j <— 1. (During this algorithm, m represents 
the number of simulated uniform deviates whose relation to p is still unknown, 
since they match p in their leading j —1 bits; and N is the number of simulated 
deviates known to be less than p.) 

B 2 . [Look at next column of bits.] Generate a random integer M with the 
binomial distribution (m, £). (Now M represents the number of unknown 
deviates that fail to match bj.) Set m m — M, and if b 3 — 1 set N <— 
N + M. 

B3. [Done?] If m = 0, or if the remaining bits (. 6 j + i 6 j -|_ 2 ... ) 2 of p are all zero, 
the algorithm terminates. Otherwise, set j <— j-\-\ and return to step B2. | 

[When bj — 1 for infinitely many j, the average number of iterations A t satisfies 
Ao — 0; A n = 1 + fOT n ~ 1 ' 

Letting A(z) = Y^A n z n /n\, we have A(z) = t z — \-\-A{\z)e zl2 . Therefore A(z)e~ z = 
1 -e- z +A(lz)e~ z / 2 = ^ fc>o (l-e-^) = 1 - e~ z -Z n>1 (-~zr/(nl(2 n - 1 )), 


Am — 1 4" y] 


fc)y=T _1 + TTi _lgn+ n72 + 5 + /o(n)+0(n 1] 


in the notation of exercise 5.2.2-48.] 

28 . Generate a random point (yi,. .., y n ) on the unit sphere, and let p — ak Vk- 
Generate an independent uniform deviate U, and if p n+1 U < Ky/^a^yl, output the 
point ( 7 / 1 /p,..., y n /p ); otherwise start over. Here K 2 = min{ (£ a k y 2 k ) n+1 /{J2 alyl) I 
= 1} — a n ~ 1 if na n > ai, ((n -f- l)/(ai -(-a n )) n+1 (oia n /n) n otherwise. 
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29. Let X n +i = 1, then set X k <— X k +iU^ k or X k <— X k +\e~ Yk/lc for k = n, 
n — 1 , ..., 1 , where U k is uniform or Y k is exponential. [ACM Trans. Math. Software, 
to appear.] 


SECTION 3.4.2 


1. There are (n—m) ways to pick n — m records from the last N — i; ways 

to pick n — m — 1 from N — t — 1 after selecting the (t -)- l)st item. 

2. Step S3 will never go to step S5 when the number of records left to be examined 
is equal to n — m. 

3. We should not confuse “conditional” and “unconditional” probabilities. The 
quantity m depends randomly on the selections that took place among the first t 
elements; if we take the average over all possible choices that could have occurred 
among these elements, we will find that (n — m)/(N — t) is exactly n/N on the 
average. For example, consider the second element; if the first element was selected in 
the sample (this happens with probability n/N), the second element is selected with 
probability (n — l)/(N — 1); if the first element was not selected, the second is selected 
with probability n/(N — 1). The overall probability of selecting the second element is 
(n/N)((n — 1 )/(N — 1)) -f- (1 — n/N)(n/(N — 1)) = n/N. 

4. From the algorithm, 

p(m, t + 1) = (^1 — - - ----j p(m, t) H- p ( m ~ 1 )* 

The desired formula can be proved by induction on t. In particular, p(n, N) = 1. 

5. In the notation of exercise 4, the probability that t = k at termination is q k — 
p{n, k) — p(n, k — l) = (£ll)/(n)* The average is E 0 </c<n fc< ? fc = ( N + !) n /( n + 1 )- 

6. Similarly, ^ 0<k<N k{k -f- 1 )q k = (N - f- 2 )(N -f- l)n/(n 2); the variance is 
therefore (N -|- 1)(7V — n)n/(n + 2)(n -|- l) 2 . 

7. Suppose the choice is 1 < xi < X 2 < • ■ ■ < x n < N. Let Xo — 0, J n +i = N-\- 1. 
The choice is obtained with probability p = IIic^nP*’ where 


Pt 

Pt 


N — ( t — 1) — n -j- m 

N (t 1 ) 

n — m 


N-(t-iy 


for Xm < t < Xm+i; 


for t — Xm-\- 1- 


The denominator of the product p is N\) the numerator contains the terms N — n, 
N — n — l, ..., 1 for those t 's that are not x’s, and the terms n, n — 1, ..., 1 for those 
V s that are x’s. Hence p = (N — ri)\n\/N\. Example: n — 3, N = 8, (xi,X 2 ,X 3 ) = 

(O 2 7V n — 5 3 2 4 22 11 
{4,0, I ), p — 87654321* 

9. The reservoir gets seven records: 1, 2, 3, 5, 9, 13, 16. The final sample consists of 
records 2, 5, 16. 

10. Delete step R6 and the variable m. Replace the I table by a table of records, 
initialized to the first n records in step Rl, and with the new record replacing the Mth 
table entry in step R4. 
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11. Arguing as in Section 1.2.10, which considers the special case n — 1, we see that 
the generating function is 


G{z) = z n 


+ 


n 



+ 


n 


n + 1 n + 1 An-1-2 n-\-2 


{ N — n 



The mean is n -f ^ n<t<N {n/t) = n(l + Hn — H n ); and the variance turns out to be 
n(H N - H n ) - n 2 (H% ] - 

12. (Note that 1 = ( btt)... (&33)(£>22), so we seek an algorithm that goes from the 
representation of it to that for tt~ U) Set b 3 <— j for 1 < j< t. Then for j = 2, 3, 
..., t (in this order) interchange bj +-+ b aj . Finally for j = t, ..., 3, 2 (in this order) 
set ba 3 <— bj. (The algorithm is based on the fact that (a t t) tt\ = 7Ti (btt).) 

13. Renumbering the deck 0, 1, ..., 2n — 2, we find that s takes card number x into 

card number (2x) mod (2 n — 1), while c takes card x into (x -)-1) mod (2n — 1). We have 
(c followed by s) = cs = sc 2 . Therefore any product of c’s and s’s can be transformed 
into the form s l c k . Also 2 <p(2n ~^ = 1 (modulo (2 n — 1)); since s ^( 2n —F and c 2n ~ 1 
are the identity permutation, at most (2 n — l)(p(2n — 1) arrangements are possible. 
(The exact number of different arrangements is (2n — 1 )k, where k is the order of 2 
modulo (2n — 1). For if s k = c 3 , then c° fixes the card 0, so s k = = identity.) For 

further details, see SIAM Review 3 (1961), 293-297. 

14. Set Yj <— j for t — n < j < t. Then for j — t, t — 1, ..., t — n -(- 1 do the 
following operations: Set k <— [jU J + 1. If k > t — n then set X 3 <— Yk and Yk <— Yj ; 
otherwise if k = X t for some i > j {a symbol table algorithm could be used), then 
set X 3 <— Yi and Yi <— Y 3 ; otherwise set X 3 <— k. (The idea is to let Yt— n +i, ..., Yj 
represent X t — n +i, • • •, X 3 , and if i > j and X t < t — n also to let Yi represent Xx x , 
in the execution of Algorithm P.) 

15. We may assume that n <\N, otherwise it suffices to find the N — n elements not 
in the sample. Using a hash table of size 2n, the idea is to generate random numbers 
between 1 and N, storing them in the table and discarding duplicates, until n distinct 
numbers have been generated. The average number of random numbers generated is 

N/N-\~ N/(N — 1)-|- \-N/(N — n-fl) < 2 n, by exercise 3.3.2-10, and the average 

time to process each number is 0(1). We want to output the results in increasing 
order, and this can be done as follows: Using an ordered hash table (exercise 6.4-66) 
with linear probing, the hash table will appear as if the values had been inserted in 
increasing order. Thus if we use a monotonic hash address such as \nk/N) for the 
key k, it will be a simple matter to output the keys in sorted order by making at most 
two passes over the table. 

16. In most cases it will be best simply to select n records one at a time by Walker’s 
alias method, with probabilities proportional to their weights, rejecting an element that 
occurs more than once. But if some weights are significantly larger than others, the 
following algorithm will sometimes be better [cf. Wong and Easton, SIAM J. Comput. 
9 (1980), 111-113]. Begin with a “complete binary tree” of weights Xi, where Xi = 
X 21 + X 2 i+\ for 1 < i < N and xn+i—i — w z for 1 < i < N\ then do the following 
operation n times: “Set j <— 1, and generate X = Ux\. If X < X 2 j, set j <— 2 j, 
otherwise set X <— X — X 2 j and j <— 2j + 1; repeat this until j > N. Then select 
element j — N + 1, set i <— [j/ 2j, and while i > 1 set Xi *— Xi — x 3 and i +— \i/2\. 
Finally set x 3 <— 0.” 





556 ANSWERS TO EXERCISES 


3.5 


SECTION 3.5 

1. A 6-ary sequence, yes (cf. exercise 2); a [0,1) sequence, no (since only finitely many 
values are assumed by the elements). 

2. It is 1-distributed and 2-distributed, but not 3-distributed (the binary number 111 
never appears). 

3. Cf. exercise 3.2.2-17; repeat the sequence there with a period of length 27. 

4. The sequence begins $, j, |, J, J, i, f, f, §, f, §, §, §, etc. When 
n = 1,3, 7, 15, ... we have i/(n) = 1, 1, 5, 5, ... so that v(2 2k ~ 1 — 1) = ^(2 2fc — 1) — 
(2 2fc — l)/3; hence v(ri)/n oscillates between ^ and approximately §, and no limit 
exists. The probability is undefined. 

[The methods of Section 4.2.4 show, however, that a numerical value can mean¬ 
ingfully be assigned to 

Pr(7/ n < ^) = Pr(leading digit of the radix-4 representation of n -f-1 is 1), 
namely log 4 2 — £.] 

5. If v\ (n), V 2 (n), /v 4 (n) are the counts corresponding to the four probabilities, 

we have v\ (n) + U 2 (n) = ^(n) + V 4 {n) for all n. So the desired result follows by 
addition of limits. 

6. By exercise 5 and induction, 

Pr(S' J (n) for some j, 1 < j < A:) = E Pr(S,(n)). 

1 <j<k 

As k —> oo, the latter is a monotone sequence bounded by 1, so it converges; and 

Pr( Sj(ri) for some j > l) > E Pr(5,(n)) 

i<j<fc 

for all k. For a counterexample to equality, it is not hard to arrange things so that 
Sj(n ) is always true for some j, yet Pr(Sj(n)) = 0 for all j. 

7. Let pi = Yl-> i Pr (Sij(n)). The result of the preceding exercise can be generalized 
to Pr(£j(n) for some j > l) > ^ ,> 1 Pr(Sj-(n)), for any disjoint statements Sj(n). 
So we have 1 = Pr(St 5 (n) for some i,j > l) > TV> x Pr(&j(n) for some j > l) > 

Pi = 1, and hence Pr(S,:,(n) for some j > l) — p t . Given e > 0, let I be large 
enough so that > 1 — e. Let 

0i(iV) = (number of n < iV with 5 XJ (n) true for some j > l)/iV. 

Clearly Ei< i<7 0i(N) < 1, and for all large enough AT we have X] 2 <i<i^ l ( N ) > 

E 2 <i hence W < 1 — 0a(N)-0/(iV) < 1— p 2 -+ e < 

1 — (1 — e — pi) -(- £ = pi + 2e. This proves that Pr(Sij(n) for some j > l) < pi -f- 2e; 
hence Pr(S , i J (n) for some j > l) = pi, and the desired result holds for i — 1. By 
symmetry of the hypotheses, it holds for any value of i. 

8. Add together the probabilities for j, j -f- d, j + 2d, ... in Definition E. 



3.5 


ANSWERS TO EXERCISES 557 


9. lim sup n _ >oc (a n + b n ) < limsup n _ >00 a n + limsup n ^ 00 6 n ; hence we find that 
lim sup(( 2 /i n — a) 2 -j-b ( y mn — a) 2 ) < ma 2 — 2ma 2 + ma 2 = 0, 

n—*-oo 

and this can happen only if each ( yj n — a) tends to zero. 

10. In the evaluation of the sum in Eq. (22). 

11. (U 2 n) is A;-distributed if (U n ) is (2, 2A;)-distributed. 

12. Let f(x i,..., Xk) = 1 if u < max(xi,..., Xk) < v; f(x i,..., Xk ) = 0 otherwise. 
Then apply Theorem B. 

13. Let 

p k = Pr (U n begins a gap of length k — 1) 

= Pr((7n-1 € [a, P), Un £ [a, P), Un+k -2 £ K P), U n +k -1 6 [a, /?)) 

= p 2 (l-p) fe ' 1 . 

It remains to translate this into the probability that f(n) — f{n — 1) = k. Let 
i'fc(n) = (number of j < n with f(j) — f(j — 1) = k); let pk{n) — (number of j < n 
with Uj the beginning of a gap of length k — 1); and let p,(n) similarly count the number 
of 1 < j < n with Uj e (a,/?). We have (ik(f{n )) = Vk{ri), — n. As n —► oo, 

we must have /(n) —► oo, hence 

u k {n)/n = (// fc (/(n))//(n)) • (f{n)/y{f{n))) p k /p = p(l — p) fc-1 . 


[We have only made use of the fact that the sequence is (k + l)-distributed.] 

14. Let 

p k z= Pt(Uti begins a run of length k ) 

= Pr(C/ n -l > U n < • • • < U n -\-k — 1 > f/n+fc) 

- srrw (C t X t *) - (* t “) - C10 + * 

k k + 1 

= (fc + !)• “ (k + 2)! 

(cf. exercise 3.3.2-13). Now proceed as in the previous exercise to transfer this to 
Pr(/(n) — }(n — 1) — k). [We have assumed only that the sequence is (k -\- 2)- 
distributed.] 

15. For s,t > 0 let 


Pr(X n - 


-2t — 3 


X n — 2 1 — 2 7 ^ Xn — 2 1 —1 7 ^ 
and -Xn — ' *" 


7 ^ Xn—l 

Xn-\-3 ^ ^Gi+s+l) 


x—s — 2 1 —3 . 


for t > 0 let q t = Pr(X n — 2 t —2 = X n - 2 t-i ^ • * • 7 ^ X n -i) — 2 2t *. By exercise 7, 
Pr(X n is not the beginning of a coupon set) = X^>o Q* = §> Pr(X n is the beginning 

of coupon set of length s-)-2) — ^ > 0 p st = J Now proceed as in exercise 13. 
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16. (Solution by R. P. Stanley.) Whenever the subsequence S = (b — 1), (b — 2), ..., 
1, 0, 0, 1, ..■, (b — 2), (b — 1) appears, a coupon set must end at the right of S, since 
some coupon set is completed in the first half of S. We now proceed to calculate the 
probability that a coupon set begins at position n by manipulating the probabilities 
that the last prior appearance of S ends at position n — 1, n — 2, etc., as in exercise 15. 

18. Proceed as in the proof of Theorem A to calculate Pr and Pr. 

19. (Solution by T. Herzog.) Yes; e.g., the sequence (U in / 2 j) when (U n ) satisfies R4 
(or even its weaker version), cf. exercise 33. 

21. Pr (Z n G Mi,..., Z n+k -1 G M k ) = p(Mi).. .p{M k ), for all Mi, ..., M k G At. 

22. If the sequence is /c-distributed, the limit is zero by integration and Theorem B. 
Conversely, note that if f(x i,..., x k ) has an absolutely convergent Fourier series 

f(x\, ..., x k ) = a (ci,... ,cjt)exp(27rz(cixi - \-c k x k )), 

— 00<Cl,...,Cfc<00 

we have lim^oo -fa Eo<n<w f( Un > * • ■ > U n + k -i) = a(0,..., 0) + e r , where 

M< ^2 |a(ci, ..., Cfc)j, 
l c i 11 • ••> l c fc I > r 

so t r can be made arbitrarily small. Hence this limit is equal to 

a(0,..., 0) = / • • • / f(x i ,...,x k )dx 1 ...dx k , 

Jo Jo 

and Eq. (8) holds for all sufficiently smooth functions /. The remainder of the proof 
shows that the function in (9) can be approximated by smooth functions to any desired 
accuracy. 

23. See AMM 75 (1968), 260-264. 

24. This follows immediately from exercise 22. 

25. If the sequence is equidistributed, the denominator in Corollary S approaches ^, 
and the numerator approaches the quantity in this exercise. 

26. See Math. Comp. 17 (1963), 50-54. [Consider also the following example by A. G. 

Waterman: Let (U n ) be an equidistributed [0,1) sequence and { X n ) an oo-distributed 
binary sequence. Let V n — or 1 — t/f^i according as X n is 0 or 1. Then (V n ) is 

equidistributed and white, but Pr(V n = 14 +i) = -j- Let W n = (14 — e n ) mod 1 where 
(e n ) is any sequence that decreases monotonically to 0; then (W n ) is equidistributed 
and white, yet Pr(W n < W n +i) = |.] 

28. Let (U n ) be oo-distributed, and consider the sequence (\(X n -|- U n )). This is 3- 
distributed, using the fact that (U n ) is (16,3)-distributed. 
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29. If x = X\X 2 ...x t is any binary number, we can consider the number i'f(n) of 
times X p .. .X p + t —i = x, where 1 < p < n and p is even. Similarly, let v® (n) count 
the number of times when p is odd. Let izf (n) -f- id/*(n) — v x {n). Now 

Vo(n) = E *( n ) ~ E u *o*..A n ) ~ E ^*o...*( n ) ~ • * * ~ E ^?**-o( n ) 

where the v 's in these summations have 2 k subscripts, 2 k — 1 of which are asterisks 
(meaning that they are being summed over—each sum is taken over 2 2k ~ 1 combinations 
of zeros and ones), and where denotes approximate equality (except for an error 
of at most 2k due to end conditions). Therefore we find that 


- 2 ki/o(ri) — -(E -fV 

n n 4 

+ ~ s ( x )) i/ ?( n ) + °(“)’ 


where x = x\ ... x 2 *; contains r(x) zeros in odd positions and s(x) zeros in even positions. 
By (2/c)-distribution, the parenthesized quantity tends to k(2 2 k ~ 1 )/2 2k = /c/2. The re¬ 
maining sum is clearly a maximum if v x (n) = is x (n) when r(x) > s(x), and (n) — 0 
when r(x) < s(x). So the maximum of the right-hand side becomes 


k 

2 


+ 


E ( r — s ) 

0< s <r < k 




Now Pr(X 2n = 0) < limsup n _ +co isq (2n)/n, so the proof is complete. Note: 


(")(”) max ( r ’ = 2n22n 2 

r, s 

J2 (”)(”) min ( r - s > = 2' 122 ” -2 



30. Let f(x i, X 2 ,.. ■, X 2 k) = signal — X 2 -f- X 3 — x 4 -\ -x 2 fc). Construct a directed 

graph with 2 2k nodes labeled (E; x 1 ,..., X 2 k— 1 ) and (O; Xi,..., x 2 k— 1 ), where each x is 
either 0 or 1. Let there be l-)-/(xi, x 2 ,..., x 2 k) directed arcs from (E; Xi,..., x 2 /c— 1 ) to 
(O; x 2 ,..., x 2 fc), and 1 —/(xi, x 2 ,..., x 2 /c) directed arcs leading from (O; x \,..., x 2 fc_i) 
to (E; x 2 ,..., x 2 /c). We find that each node has the same number of arcs leading into it 
as there are leading out; for example, (E; Xi,..., x 2 k__i) has 1 — /(0, Xi,..., x 2 fc— 1 ) + 
1 — /(1, Xi,..., x 2 fc— 1 ) leading in and 1 + f[x lf ..., x 2fc _i, 0) + l + /(xi,..., x 2fc _i, 1) 
leading out, and f(x, xi ,..., x 2 k_i) = — f(x 1 ,..., x 2 /c_i, x). Drop all nodes that have 
no paths leading either in or out, i.e., (E; xi,..., x 2 A;— 1 ) if /(0, Xi,..., x 2 fc— 1 ) = —f— 1, or 
(O; Xi,..., x 2 fc— 1 ) if /(l, xi,..., x 2 A:— 1 ) — —1. The resulting directed graph is seen to 
be connected, since we can get from any node to (E; 1, 0,1,0,..., 1) and from this to any 
desired node. By Theorem 2.3.4.2G, there is a cyclic path traversing each arc; this path 
has length 2 2k+l , and we may assume that it starts at node (E; 0, ...,0). Construct 
a cyclic sequence with X\ = • • ■ — X 2 k —1 = 0, and X n + 2 k —1 = x 2 k if the nth arc 
of the path is from (E; Xi,..., x^k—i) to (0; x 2 ,..., x 2 k) or from (0; Xi,..., x 2 fc— 1 ) to 
(E;x 2 , ..., i 2 fc). For example, the graph for k — 2 is shown in Fig. A-5; the arcs of 
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Fig. A-5. Directed graph for the construction in exercise 30. 


the cyclic path are numbered from 1 to 32, and the cyclic sequence is 

(00001000110010101001101110U1110)(00001...). 

Note that Pr(X 2 n = 0) = $ in this sequence. The sequence is clearly (2/c)-distributed, 
since each (2fc)-tuple X 1 X 2 ... X 2 k occurs 1 + f{x i,..., X 2 k ) + 1 — /(zi,..., X 2 k) = 2 
times in the cycle. The fact that Pr(X 2 n = 0) has the desired value comes from the fact 
that the maximum value on the right-hand side in the proof of the preceding exercise 
has been achieved by this construction. 

31. Use Algorithm W with rule Zi selecting the entire sequence. [For a generalization 
of this type of nonrandom behavior in R5-sequences, see Jean Ville, Etude Critique 
de la notion de Collectif (Paris, 1939), 55-62. Perhaps R6 is also too weak, from this 
standpoint.] 

32. If Z, Z' are computable subsequence rules, so is Z" — HZ' defined by the following 
functions: /"(zo, •.., x n — i) = 1 iff Z defines the subsequence x ri , ..., x rk of Xq, ..., 
x n — i, where k > 0 and 0 < ri < • • • < r k < n and f k (x ri ,..., x rk ) = 1. 

Now (Xn)ZZ' is ((X n )Z)Z r . The result follows immediately. 

33. Given e > 0, find No such that N > No implies that both \u r {N)/N — p\ < e 
and \u a (N)/N — p\ < e. Then find Ni such that N > iVi implies that tjv is tm or sm 
for some M > No. Now N > Ni implies that 

l't(N) _ v r (N r ) 4- v s (Ns) _ v r (N r ) — pN r + u a (N s ) — pN s ^ 

N P ~ N P ~ Nr -j- N a <e 

34. For example, if the binary representation of t is (1 O 6-2 1 0 fll 1 1 0 a2 1... 1 0 afc )2, 
where “0 a ” stands for a sequence of a consecutive zeros, let the rule Zt accept U n if 
and only if \bU n —k\ = oi, ..., [bU n —lj = a*. 

35. Let do = 5o and a m +i — max{ s k \ 0 < k < 2 ttm }. Construct a subsequence 
rule that selects element X n if and only if n = Sk for some k < 2 am , when n is in the 
range a m < n < a m +i- Then lim m _oo v{a m )/am = J. 
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36. Let b and k be arbitrary but fixed integers greater than 1. Let Y n — [bU n J. An 

arbitrary infinite subsequence { Z n } = (Y Sn )R determined by algorithms S and R (as 
in the proof of Theorem M) corresponds in a straightforward but notationally hopeless 
manner to algorithms S' and R' that inspect X t , X t +i, ..., and/or select X t , 

ATt+i, ..., -^Ct-f-min(fc—i,s) of { X n ) if and only if S and R inspect and/or select Y s , where 
U s = (OXtXt+i .. . X t + S ) 2 - Algorithms S' and R' determine an infinite 1-distributed 
subsequence of (X n ) and in fact (as in exercise 32) this subsequence is oo-distributed 
so it is (k, l)-distributed. Hence we find that Pr (Z n = a) and Pr (Z n = a) differ from 
1/6 by less than l/2 fc . 

[The result of this exercise is true if “R6” is replaced consistently by “R4” or 
“R5”; but it is false if “Rl” is used, since Xp-j might be identically zero.] 

37. For n > 2 replace U n 2 by \{U n 2 -(- 6 n ), where 6 n = 0 or 1 according as the 

set {£/(„_!)2 + 1 ,..., U n 2 _ !} contains an even or odd number of elements less than 

[Advances in Math. 14 (1974), 333-334.] 

39. See Acta Arithmetica 21 (1972), 45-50. The best possible value of c is unknown. 

40. If every one-digit change to a random table yields a random table, all tables are 
random (or none are). If we don’t allow degrees of randomness, the answer must 
therefore be, “Not always.” 


SECTION 

1 3.6 



1. RAND I 

STJ 

9F 

Store exit location. 


STA 

8F 

Store value of k. 


LDA 

XRAND 

tA <— X. 


MUL 

7F 

rAX <- aX. 


INCX 

1009 

rX <— (aX -f- c) mod m. 


JOV 

*+l 

Ensure that overflow is off. 


SLAX 

5 

rA <— (aX 4- c) mod m. 


STA 

XRAND 

Store X. 


MUL 

8F 

rA <— [kX /raj. 


INCA 

1 

Add 1, so that 1 < Y < k. 

9H 

JMP 

* 

Return. 

XRAND 

CON 

1 

Value of X; X 0 = 1. 

8H 

CON 

0 

Temp storage of k. 

7H 

CON 

3141592621 

The multiplier a. | 


2. Putting a random number generator into a program makes the results essentially 
unpredictable to the programmer. If the behavior of the machine on each problem were 
known in advance, few programs would ever be written. As Turing has said, the actions 
of a computer quite often do surprise its programmer, especially when a program is 
being debugged. 

So the world had better watch out. 

7. In fact, you only need the 2-bit values [X n /2 16 J mod 4; see D. E. Knuth, “De¬ 
ciphering a linear congruential encryption,” to appear. See also J. Reeds, Cryptologia 
1 (1977), 20-26, 3 (1979), 83-95, for solutions to related problems. 
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SECTION 4.1 

1. 1010, 1011, 1000, 11000, 11001, 11110. 

2. (a) —110001, —11.001001001001..., 11.0010010000111111... . 

(b) 11010011, 1101.001011001011... , 111.011001000100000... . 

(c) lull, To.olloiioTloii..., lo.oillllll... . 

(d) —9.4, — ... 7582417582413, ... 562951413. 

3. (1010113.2)ai. 

4. (a) Between rA and rX. (b) The remainder in rX has radix point between bytes 3 
and 4; the quotient in rA has radix point one byte to the right of the least significant 
portion of the register. 

5. It has been subtracted from 999 ... 9 = 10 p — 1, instead of from 1000 ... 0 = 10 p . 

6. (a,c) 2 P—1 - 1, —(2 P—1 - 1); (b) 2 p - x - 1, -2P- 1 . 

7. A ten’s complement representation for a negative number x can be obtained by 

considering 10 n + x (where n is large enough for this to be positive) and extending it 
on the left with infinitely many nines. The nines’ complement representation can be 
obtained in the usual manner. (These two representations are equal for nonterminating 
decimals, otherwise the nines’ complement representation has the form ... (a)99999... 
while the ten’s complement representation has the form ... (a -f 1)0000... .) The 
representations may be considered sensible if we regard the value of the infinite sum 
N = 9 + 90 + 900 + 9000 -\ -as —1, since N — 10 N = 9. 

See also exercise 31, which considers p-adic number systems. The latter agree 
with the p’s complement notations considered here, for numbers whose radix-p repre¬ 
sentation is terminating, but there is no simple relation between the field of p-adic 
numbers and the field of real numbers. 

8* Yij = ^ 2 j(o>kj-\-k—ib k 1 + • • ■ ~b akj)b kj . 

9. A BAD ADOBE FACADE FADED. [Note; Other possible “number sentences” would be 
DO A DEED A DECADE; A CAD FED A BABE BEEF, COCOA, COFFEE; BOB FACED A DEAD D0D0.] 

..., 03, 0-2, o>i, flo; o,— i, a— 2, ... ..., A3, A2, A\, Ao; A—i, A—2, • - ■ • <. 

1U - [...,& 3 ,6 a ,6 lf *>; 6-1, 6-2, ...J - [...,B 3 ,B 2 ,B u B 0 ) ...J’ 


Aj = 


— 1 1 — 2 , • • • , flfcj 

bk j + 1 — 2, ■ ■■ , bkj ’ 


Bj — bk 3+1 —1 ■ • • bkj, 


where (k n ) is any infinite sequence of integers with kj +1 > kj. 

11 . (The following algorithm works both for addition or subtraction, depending on 
whether the plus or minus sign is chosen.) 

Start by setting k <— a n+ i a n + 2 bn+i <— &n+2 <— 0; then for m = 0,1, 

..., n -\- 2 do the following: Set c m <— a m i bm + k; then if c m > 2 , set k * -1 and 

Cm ^— Cm — 2; otherwise if c m < 0, set k *— 1 and c m <— c m + 2; otherwise (i.e., if 
0 < c m < 1), set k <— 0. 

12. (a) Subtract ±( • • - a 3 0ai0)_2 from ±( • • • a 4 0a20ao )—2 in the negabinary system. 
(See also exercise 7.1-18 for a trickier solution that uses full-word logical operations.) 
(b) Subtract (... & 30 fri 0) 2 from (... fr 4 0& 2 0bo)2 in the binary system. 
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Fig. A-6. Fundamental region 
for quater-imaginary numbers. 


13. (1.909090...)_ 10 = (0.090909...)_io = 

14. 113 2 1 [5 — 4z] 

113 2 1 [5 — 4 i] 

113 2 1 

112 0 2 
12 12 3 
113 2 1 
113 2 1 _ 

0 1 0 3 1 1 2 0 1 [9 — 40i] 

15. [—yt, yj-], and the rectangle shown in Fig. A-6. 

16. It is tempting to try to do this in a very simple way, by using the rule 2 = (1100)*—! 

to take care of carries; but that leads to a nonterminating method if, for example, we 
try to add 1 to (11101)^_i = —1. 

The following solution does the job by providing four related algorithms (namely 
for adding or subtracting 1 or i). If a is a string of zeros and ones, let q p be a string 
of zeros and ones such that (a p ) t _i = (a)i—i -f 1; and let a ~ p , oc Q , a~~ Q be defined 
similarly, with —1, -j -i, and —i respectively in place of +1. Then 

(a0) p = otl; (axl) p = a Q x0. (a0) Q = c* p l; (c*l) Q = a -Q 0. 

(az0) _P = a~ Q xl; (al)~ p = a0. (a0) _Q = ct Q l; (al)~ Q ~ a~ P 0. 

Here x stands for either 0 or 1, and the strings are extended on the left with zeros 
if necessary. The processes will clearly always terminate. Hence every number of the 
form a + bi with a and b integers is representable in the i — 1 system. 

17. No (in spite of exercise 28); the number —1 cannot be so represented. This can 
be proved by constructing a set S as in Fig. 1. We do have the representations —i = 

(0.1111... )i+», i = (100.1111... )i +i . 

18. Let -So be the set of points (aTaeas^asc^niUo)!— i, where each ak is 0 or 1. (Thus, 

So is given by the 256 interior dots shown in Fig. 1, if that picture is multiplied 
by 16.) We first show that S is closed: If y\, ... is an infinite subset of S, we 

have y n = ^ fc>1 a n fcl6 — k , where each a n k is in So. Construct a tree whose nodes are 

(a n i,..., a nr ), for 1 < r < n, and let a node of this tree be an ancestor of another 



564 ANSWERS TO EXERCISES 


4.1 


node if it is an initial subsequence of that node. By the infinity lemma this tree has 
an infinite path {ai,a 2 , a 3 ,...), and it follows that fc is a limit point of 

{yi, 2 / 2 ,...} in S. 

By the answer to exercise 16, all numbers of the form (a-\-bi)/ 16 k are representable, 
when a and 6 are integers. Therefore if x and y are arbitrary reals and A; > 1, the 
number Zk — ([16 fc zJ + Ll 6 fc yj 2 )/ 16 fc is in S + m + ni for some integers m and n. It 
can be shown that S + m -f- ni is bounded away from the origin when (m, n) + ( 0 , 0 ). 
Consequently if |x| and \y\ are fixed and k is sufficiently large, we have Zk £ S, and 
limfc —►OO Zk = x yi is in S. 

[B. Mandelbrot calls S the “twindragon,” since he noticed that it is essentially 
obtained by joining two “dragon curves” belly-to-belly; see his book Fractals: Form , 
Chance, and Dimension (San Francisco: Freeman, 1977), 313-314. Other properties of 
the dragon curve are described in C. Davis and D. E. Knuth, J. Recr. Math. 3 (1970), 
66-81, 133-149.] 


19. If m > u or m < l, find a £ D such that m = a (modulo b ); the desired 
representation will be a representation of m' — (m — a)/b followed by a. Note that 
m > u implies l < m! < m; m < l implies m < m! < u\ so the algorithm terminates. 

[There are no solutions when 6 = 2. The representation will be unique iff 0 £ D; 
nonunique representation occurs for example when D = {—3,—1,7}, 6 = 3, since 
(a) 3 = ( 3773 a!) 3 . When 6 > 3 it is not difficult to show that there are exactly 2 b ~ 3 
solution sets D in which |a| < 6 for all a £ D. Furthermore the set D = {0, 1, 2 —t 26 n , 
3 — e 3 b n , ..., 6 — 2 — £ 5 — 26 ”, 6 — 1 — b n } gives unique representations, for all 6 > 3 
and n > 1, when each e, is 0 or 1 . Reference: Proc. IEEE Symp. Comp. Arith. 4 
(1978), l-9.j 


111 

777 


— 18}. 


222 

666 


= ••• = 18 


123456 777 
765432-111 


20 . (a) 0 . 111 ... = 1 . 888 ... = 18. 
has nine representations, (b) A “D-fraction” .<2102 ... always lies between —1/9 and 
+71/9. Suppose x has ten or more D-decimal representations. Then for sufficiently 
large k, I0 k x has ten representations that differ to the left of the decimal point: 10 fc £ = 
ni + /1 = • • • = nio + fio where each fj is a D-fraction. By uniqueness of integer 
representations, the rij are distinct, say n\ < • - • < nio, hence nio — ni >9; but this 
implies f\ — /10 > 9 > 71/9 — (—1/9), a contradiction, (c) Any number of the form 


0 .ai< 22 ..., where each dj is —1 or 8 , equals l.a^a^ ... where a' 3 = a,j + 9 (and it even 
has six more representations I 8 .aia 2 ..., etc.). 


21. We can convert to such a representation by using a method like that suggested in 
the test for converting to balanced ternary. 

In contrast to the systems of exercise 20, zero can be represented in infinitely 
many ways, all obtained from ^ + Hfc>i( — ^i) ' ( or ^ rom the negative of this 
representation) by multiplying it by a power of ten. The representations of unity are 
14 — 4- i + l> 5-31-J, 5-4* + i, 50-45-3^-+ 50-45-4* + *, etc., 
where ±* = (±4i)(10 ~ 1 + 10 ~ 2 H-). [AMM 57 (1950), 90-93.] 

22 . Given some approximation b n . . .6160 with error X!o<fc<n — x > 10 * for 

6 > 0, we will show how to reduce the error by approximately 10 —t . (The process can 
be started by finding a suitable XlocfcCn bkl0 k > x\ then a finite number of reductions 
of this type will make the error less than e.) Simply choose m > n so large that the 
decimal representation of — 10 m a has a one in position 10 — * and no ones in positions 
IQ—t-f- 1 , iq-^ 2 , ... f 10 n . Then 10 ^ 0 ; + (a suitable sum of powers of 10 between 10 m 
and 10 ") + £„<*<„ « Eo<k. W 0 k - IQ- 1 . 
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23. The set S = {'£2 k>1 a kb~~ k | a k £ D } is closed as in exercise 18, hence it is 

measurable, and in fact it has positive measure. Since bS = |J a€D (a + S ), we have 
bfi(S) = fi(bS) < ^ a£D M( a + 'S') = ~ b/i(S), and we must therefore have 

fi((a -J- S) fl {a' -f S)) — 0 when a 7 ^ a' £ D. Now T has measure zero since it is a 
union of countably many sets of the form 10 fe (n ((a -(- S) fl ( a' + S))), a 7 ^ a', each 
of measure zero. 

[The set T cannot be empty, since the real numbers cannot be written as a 
countable union of disjoint, closed, bounded sets; cf. AMM 84 (1977), 827-828. If 
D has less than b elements, the set of numbers representable with radix b and digits 
from D has measure zero. If D has more than b elements and represents all reals, T 
has infinite measure.] 

24. { 2a-10 fc -j-a' |0<a<5,0<a'<2}or{ 5a'-10 te +a | 0 < a < 5,0 < a' < 2 }, 
for k > 0. [R. L. Graham has shown that there are no more sets of integer digits with 
these properties. And Andrew Odlyzko has shown that the restriction to integers is 
superflous, in the sense .that if the smallest two elements of D are 0 and 1, all the digits 
must be integers. Proof: Let S — {^ fc< 0 & kb k \ ak £ D} be the set of “fractions,” 
and let X = { (a n ... a 0 )b \ a k £ D} be the set of “whole numbers”; then [0, 00 ) = 
U l€X (z + S), and (x -f- 5) n (x' + S) has measure zero for x 7 ^ x' £ X. We have 

(0,1) C S, and by induction on m we will prove that (m, m + 1) C x m + S for some 
Xm € X. Let x m £ X be such that (m, m -f- e) fl (x m + S) has positive measure for all 
e > 0. Then x m < m, and x m must be an integer lest X[x m j + S overlap x m + S too 
much. If x m > 0, the fact that (m — x m ,m — x m -f-1)H S has positive measure implies 
by induction that this measure is 1, and ( m,m 1) C x m + S si nce S is closed. If 
x m — 0 and (m, m + 1) $Z S, we must have m < x 7 m < m -j- 1 for some x' m £ X, 
where (m, x' m ) C S; but then 1 -f- S' overlaps x' m -\- S. See Proc. London Math. Soc. 
(3) 18 (1978), 581-595.] 

Note: If we drop the restriction 0 £ D, there are many other cases, some of which 
are quite interesting, especially {1, 2, 3, 4, 5,6,7, 8, 9,10}, {1, 2, 3, 4, 5, 51, 52, 53, 54, 55}, 
and (2,3, 4, 5, 6, 52, 53, 54, 55, 56}. Alternatively if we allow negative digits we obtain 
many other solutions by the method of exercise 19, plus further sets of unusual digits 
like {—1,0,1,2,3,4, 5,6,7,18} that don’t meet the conditions stated there. It appears 
hopeless to find a nice characterization of all solutions with negative digits.] 

25. A positive number whose base b representation has m consecutive ( b — l)’s to 

the right of the decimal point must have the form c/b n -f ( b m — 0)/b n+rn , where c 
and n are nonnegative integers and 0 < 0 < 1. Soifu/'U has this form, we find 
that b m+n u = -j- b m v — 9v. Therefore 6v is an integer that is a multiple of b m . 

But 0 < $v < v < b m . [There can be arbitrarily long runs of other digits aaaaa, if 
0 < a < b — 1, for example in the representation of a/(b — 1).] 

26. The proof of “sufficiency” is a straightforward generalization of the usual proof for 
base b, by successively constructing the desired representation. The proof of “necessity” 
breaks into two parts: If (3 n +i is greater than J2k<n Ck ^ k ^ or some n > ^ en ^ n + 1 ~~ e 
has no representation for small e. If /3 n +1 < Ylk<n Ck ^ k ^ or n ’ e 9 ua i^y d° es 
not always hold, we can show that there are two representations for certain x. [See 
Transactions of the Royal Society of Canada, series El, 46 (1952), 45-55.] 

27. Proof by induction on |n|: If n is even we must take eo > 0, and the result follows 
by induction, since n/2 has a unique such representation. If n is odd, we must take 
e 0 = o, and the problem reduces to representing —(n — l)/2; if the latter quantity is 
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either zero or one, there is obviously only one way to proceed, otherwise it has a unique 
reversing representation by induction. 

[It follows that every positive integer has exactly two such representations with 
decreasing exponents eo > e\ > • • • > e t : one with t even and the other with t odd.] 

28. A proof like that of exercise 27 may be given. Note that a bi is a multiple of 
1 -j-z by a complex integer if and only if a-\-b is even. This representation is intimately 
related to the dragon curve discussed in the answer to exercise 18. 

29. It suffices to prove that any collection {To, Ti, T 2 , ...} satisfying Property B may 
be obtained by collapsing some collection {So, Si, S 2 , ■.. }, where So — { 0 , 1 ,..., b — 1 } 
and all elements of Si, S 2 , ... are multiples of b. 

To prove the latter statement, we may assume that 1 E To and that there is a least 
element b > 1 such that b £ To. We will prove, by induction on n, that if nb £ T 0 , 
then nb -j- 1, nb 2, ..., nb -j- b — 1 are not in any of the T/s; but if nb E To, then 
so are nb + 1, ..., nb + b — 1. The result then follows with Si = {nb \ nb E To }, 
S 2 = Ti, S 3 = T 2 , etc. 

If nb £ To, then nb = to + £1 -f- • • •, where £ 1 , £ 2 , ... are multiples of b; hence 

£0 < nb is a multiple of b. By induction, (£0 + k) + h + £2 H-is the representation 

of nb k, for 0 < k < 6 ; hence nb k £ Tj for any j. 

If nb E To and 0 < k < b, let the representation of nb -j- k be £0 + £1 + • • •. 
We cannot have tj = nb -f k for j > 1 , lest nb-\-b have two representations (b — A:) + 

• • • + (nb -f- k) -f- • • • = (nb) + ••♦ + £)-[-•■•. By induction, to mod b = k; and the 
representation nb = (to — k) -\-t 1 + • • • implies that £0 — nb -\- k. 

[Reference: Nieuw Archief voor Wiskunde (3) 4 (1956), 15-17. A finite analog of 
this result was derived by P. A. MacMahon, Combinatory Analysis 1 (1915), 217-223.] 

30. (a) Let Aj be the set of numbers n whose representation does not involve bj\ then 
by the uniqueness property, n E Aj iff n + b 3 g Aj. Consequently we have n E Aj iff 
n + 2bj E Aj. It follows that, for j yA k, n E Aj C\A k iff n + 2 bjb k E Aj f| A k . Let m 
be the number of integers n E Aj C 1 A k such that 0 < n < 2 bjb k . Then this interval 
contains exactly m integers that are in Aj but not A k , exactly m in A k but not Aj, and 
exactly m in neither Aj nor A k ; hence Am = 2bjb k . Therefore bj and b k cannot both 
be odd. But at least one b 3 is odd, of course, since odd numbers can be represented, 

(b) According to (a) we can renumber the b 's so that bo is odd and 61 , £> 2 , • • • are even; 
then ^bi, ^ 62 , • • • must also be a binary basis, and the process can be iterated. 

(c) If it is a binary basis, we must have positive and negative d k s for arbitrarily 
large k, in order to represent = | = 2 n when n is large. Conversely, the following algorithm 
may be used: 

51. [Initialize.] Set k <— 0. 

52. [Done?] If n — 0, terminate. 

53. [Choose.] If n is odd, include 2 k d k in the representation, and set n <— (n — d k )/2. 

Otherwise set n «- n/ 2 . 

54. [Advance k.) Increase k by 1 and return to S 2 . | 

At each step the choice is forced; furthermore step S3 always decreases |n| unless 
n = — d k , hence the algorithm must terminate. 

(d) Two iterations of steps S2-S4 in the preceding algorithm will change Am —► m, 
Am -j- 1 —>• m -|- 5, Am -f- 2 —► m + 7, Am -f- 3 —► m — 1. Arguing as in exercise 19, we 
need only show that the algorithm terminates for —2 < n < 8 ; all other values of n 
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are moved toward this interval. In this range 3 —► —1 —► —2 —► 6 —► 8 —» 2 —>7—>0 
and 4 —► 1 —+• 5 —► 6. Thus 1 = 7 • 2° — 13- 2 1 + 7 • 2 2 —13- 2 3 —13 • 2 5 —13- 2 9 + 7 ■ 2 10 . 

Note: The choice do, d\, d 2 , ... = 5, —3, 3, 5, —3, 3, ... also yields a binary 
basis. For further details see Math. Comp. 18 (1964), 537-546; A. D. Sands, Acta 
Mathematica, Acad. Sci. Hung., 8 (1957), 65-86. 

31. (See also the related exercises 3.2.2-11, 4.3.2-13, 4.6.2-22.) 

(a) By multiplying numerator and denominator by suitable powers of 2, we may 
assume that u = (... U 2 UiUo )2 and v = (... f 2 ih 1 * 0)2 are 2-adic integers, where Vo — 1. 
The following computational method now determines w, using the notation to 
stand for the integer (w n —i • • .^ 0)2 = wmod2" when n > 0: 

Let wo = u 0 and w ^ = wq. For n — 1,2,..., assume that we have found 
an integer w ^ = (w n — 1 ... ^ 0)2 such that u^ (modulo 2 n ). Then we 

have u (n+1) = v (n+1 W n) (modulo 2 n ), hence w n = 0 or 1 according as the quantity 
[u {n+1) - v {n+1) w {n) )mod2 n+1 is 0 or 2 n . 

(b) Find the smallest integer k such that 2 te = 1 (modulo 2 n -(- 1). Then we have 
l/(2n -f- 1) = m/(2 k — 1) for some integer m, 1 < m < 2 k ~ 1 . Let a be the /c-bit 
binary representation of m; then (O.aaa... )2 times 2n -j- 1 is (0.111... ) 2 = 1 in the 
binary system, and (... 0 : 00)2 times 2n -f- 1 is (•.. 111)2 = —1 in the 2-adic system. 

(c) If u is rational, say u — m/(2 e n) where n is odd and positive, the 2-adic 
representation of u is periodic, because the set of numbers with periodic expansions 
includes —1 jn and is closed under the operations of negation, division by 2, and 
addition. Conversely, if Un+\ = un for all N > p, the 2-adic number (2 X — 1)2 —M u 
is an integer. 

(d) The square of any number of the form (...u 2 uil) 2 has the form (...001) 2 , 
hence the condition is necessary. To show the sufficiency, we can use the following 
procedure to compute v = y/n when n mod 8 = 1: 

HI. [Initialize.] Set m «— (n — l)/8, k *— 2, Vo <— 1, V\ *- 0, v 1. (During this 
algorithm we will have v — (vk—i ••• ^ 1 ^ 0)2 and v 2 = n — 2 fc+1 m.) 

H2. [Transform.] If m is even, set Vk 0, m m/2. Otherwise set Vk <— 1, m «— 
(m — v — 2 k ~ 1 )/2, v *-v + 2 k . 

H3. [Advance k.] Increase k by 1 and return to H2. | 

32. A generalization appears in Math. Comp. 29 (1975), 84-86. 

33. Let K n be the set of all such n-digit numbers, so that k n = ||AT n ||. If S and T 
are any finite sets of integers, we shall say S ~ T if S = T -f- x for some integer x, 
and we shall write k n (S ) = ||/C n (S , )||, where K n {S) is the family of all subsets of K n 
that are ~ S. When n = 0, we have k n {S ) = 0 unless ||5|| < 1, since zero is the only 
“0-digit” number. When n > 1 and S = {si,..., 5 r }, we have 

K n (S) = u u { {tib -\- ai,..., t r b -f- a r } [ 

0 <j<b (ai,...,a r ) 

{ti,...,tr} G hC n —1 ({(St + j — di)/b | 1 < i < r})}, 

where the inner union is over all sequences of digits (ai,..., a r ) satisfying the condition 
a,i = Si -j- j (modulo b) for 1 < i < r. In this formula we require ti — ti> — 
(,Si — a t )/b — (s t > — a,i')/b for 1 < i < i' < r, so that the naming of subscripts is 
uniquely determined. By the principle of inclusion and exclusion, therefore, we have 
k n {S) = Eo<j<bEm>i( _i r" 1 /(S,^j), where f{S,m,j) is the number of sets 
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of integers that can be expressed as {tib + ai, ..., t r b + a r } in the above manner for 
m different sequences (ai,...,a r ), summed over all choices of m different sequences 
(ai,...,a r ). Given m different sequences ((+ , ...,a^) for 1 < l < ra, the number 
of such sets is k n —i{{{si + j — af^/6 | 1 < i < r, 1 < l < m}). Thus there is a 
collection of sets T(5) such that 

kn{S) = ^ ^ Ct kn —1 (T), 

TeT(S) 

where each Ct is an integer. Furthermore if T £ T(S), its elements are near those of 5; 
we have min T > (min S — max D) jb and max T < (max 5 + 6 — 1 — min D ) Jb. Thus 
we obtain simultaneous recurrence relations for the sequences ( k n {S )), where 5 runs 
through the nonempty integer subsets of [l, u-\- 1], in the notation of exercise 19. Since 
k n = k n (S ) for any one-element set 5, the sequence (k n ) appears these recurrences. 
The coefficients ct can be computed from the first few values of fc n (5), so we can 
obtain a system of equations defining the generating functions ks{z) = J2^ n (^) zU — 

(l|S|| < 1) + *ET6T(S)M«)- 

For example, when D = {—1,0,3} and 6 — 3 we have l = —§ and u = so the 
relevant sets 5 are {0}, {0,1}, {—1,1}, and {—1,0,1}. The corresponding sequences 
for n < 3 are (1, 3,8, 21), (0,1, 3, 8), (0,0,1,4), and (0, 0,0,0); so we obtain 

k 0 (z) = 1 + z(Zk 0 {z) — koi(z)), k 02 {z) = z(k Q1 {z) + £ 02 ( 2 )), 

koi{z) = zko(z), loi 2 (z) = 0, 

and k(z) = 1/(1 — 3 z -j- z 2 ). In this case k n = F 2n +2 and k n ({ 0, 2}) = F 2n —1 —BI¬ 


SECTION 4.2.1 

1. N = (62,+.60 22 52 00); h = (37 ,+.10 54 50 00). Note that 10ft would be 
(38, +.01 05 45 00). 

2 . b E ~ q (l - b~ p ), b- q ~ p - b E — q (l - b~ p ), b ~ q - 1 . 

3. When e does not have its smallest value, the most significant “one” bit (which 
appears in all such normalized numbers) need not appear in the computer word. 

4. (51, +.10209877); (50, +.12346000); (53, +.99999999). The third answer would be 
(54, + .10000000) if the first operand had been (45, —.50000000). 

5. If x ~ y and m is an integer then mb + x ~ mb + y. Furthermore x ~ y implies 
x/b ~ y/b, by considering all possible cases. Another crucial property is that x and y 
will round to the same integer, whenever x ~ y. 

Now if b~ p ~ 2 F v + f v we must have (6 p+2 +) mod 6 + 0; hence the transformation 
leaves f v unchanged unless e u — e v > 2. Since u was normalized, it is nonzero and 
\fu + fv\ > 6“ 1 — 6 —2 > b~ 2 : the leading nonzero digit of } u + fv must be at most 
two places to the right of the radix point, and the rounding operation will convert 
b pJrj (f u + f v ) to an integer, where j < 1. The proof will be complete if we can show 
that b p+J + 1 (f u + fv) ~ b p+3+1 (f u + b~ p ~ 2 F v ). By the previous paragraph, we have 
b p+2 (fu J r fv) ~ 6 p+2 / u + F v — b p+2 (f u -\-b~ p ~ 2 F v ), which implies the desired result 
for all j < 1. 

Note that, when 6 > 2 is even, such an integer F v always exists; but when 6 = 2 
we require p + 3 bits (let 2 F v be an integer). When 6 is odd, an integer F v always exists 
except in the case of division, when a remainder of 56 is possible. 
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6. (Consider the case e u = e v , f u = —fv in Program A.) Register A retains its 
previous sign, as in ADD. 

7. Say that a number is normalized iff it is zero or its fraction part lies in the range 
5 < I/I < ?• A (p + l)-place accumulator suffices for addition and subtraction; 
rounding (except during division) is equivalent to truncation. A very pleasant system 
indeed! We might represent numbers with excess-zero exponent, inserted between the 
first and subsequent digits of the fraction, and complemented if the fraction is negative, 
so that fixed-point order is preserved. 

8. (a) (06, +.12345679) 0 (06, —.12345678), (01, +.10345678) 0 (00, —.94000000); 
(b) (99, +.87654321) 0 itself, (99, +.99999999) 0 (91, +.50000000). 

9. a = c = (-50,+.10000000), b = (-41,+.20000000), d = (-41,+.80000000), 

y = ( 11 ,+. 10000000 ). 

10. (50, +.99999000) 0 (55, +.99999000). 

11. (50, +.10000001) ® (50, +.99999990). 

12. If 0 < |/„| < j/„|, then \f u \ < \fv\-b~ p ) hence 1 /b < \f u jfv\ < l-b~ p /\f v \ < 
1 — b~ p . IfO < \f v \ < \fu\, we have 1/6 < \f u /fv\/b < ((l—b~ p )/(l/b))/b = 1 —b~ p . 

13. See J. Michael Yohe, IEEE Transactions C-22 (1973), 577-586; cf. also exercise 
4.2.2-24. 


14 . FIX STJ 

9 F 

Float-to-fix subroutine: 

STA 

TEMP 


LD 1 

TEMP(EXP) 

rll <— e. 

SLA 

1 

r A ■*—+//// 0 . 

JAZ 

9 F 

Is input zero? 

DEC 1 

1 


CMPA 

= 0 = ( 1 : 1 ) 

If leading byte is zero, 

JE 

*-4 

shift left again. 

ENN 1 

-Q- 4,1 


JIN 

FIX 0 VFL 0 

Is magnitude too large? 

ENTX 

0 


SRAX 

0.1 


CMPX 

= 1 / /2- 


JL 

9 F 


JG 

*+2 


JA 0 

9 F 


STA 

*+ 1 ( 0 : 0 ) 

Round, if necessary. 

INCA 

1 

Add +1 (overflow is impossible) 

9 H JMP 

* 

Exit from subroutine. | 

15 . FP STJ 

EXITF 

Fractional part subroutine 

J 0 V 

0 FL 0 

Ensure overflow is off. 

STA 

TEMP 

TEMP +- u. 

ENTX 

0 


SLA 

1 

tA <- f u . 
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LD 2 

TEMP(EXP) 

rI2 <- e u . 


DEC 2 

Q 



J 2 NP 

*+3 



SLA 

0,2 

Remove integer part of u. 


ENT 2 

0 



JANN 

IF 



ENN 2 

0,2 

Fraction is negative: find 


SRAX 

0,2 

its complement. 


ENT 2 

0 



JAZ 

*+2 



INCA 

1 



ADD 

WM 1 

Add word size minus one. 

1 H 

INC 2 

Q 

Prepare to normalize the answer 


JMP 

NORM 

Normalize, round, and exit. 

8 H 

EQU 

1 (1:1) 


WM 1 

CON 

8 B- 1 , 8 B- 1 ( 1 : 4 ) 

Word size minus one | 


16. If [c| > |d|, then set ?■ <- d 0 c, s c 0 (r 0 d); i (a 0 (i) 0 r)) 0 s j f- 
(60 + 0+) 0 s. Otherwise set r «— c0d, s *— d0(r(g)c); x ((a 0 r) 0 6) 0 s, y <— 
((60r)©a)0s. Then x + ?y is the desired approximation to {a-\-bi)/{c-\-di). [CACM 
5 (1963), 435. Other algorithms for complex arithmetic and function evaluation are 
given by P. Wynn, BIT 2 (1962), 232-255; see also Paul Friedland, CACM 10 (1967), 
665.] 

17. See Robert Morris, IEEE Transactions C-20 (1971), 1578-1579. Error analysis 
is more difficult with such systems, so interval arithmetic is correspondingly more 
desirable. 

18. For positive numbers: shift fraction left until fi = 1, then round, then if the 
fraction is zero (rounding overflow) shift it right again. For negative numbers: shift 
fraction left until f\ — 0, then round, then if the fraction is zero (rounding underflow) 
shift it right again. 

19. (43(1 if e v < e u )—(1 if fraction overflow)—(10 if result zero)+(4 if magnitude is 
rounded up)+(l if first rounding digit is bf 2)-j-(5 if rounding digits are 6/20... 0)-(-(7 if 
rounding overflow) + IN + A(— 1 + (11 if N > 0)))u, where N is the number of left 
shifts during normalization and A — 1 if rX receives nonzero digits (otherwise A = 0). 
The maximum time of 73 u occurs for example when 

u = +50 01 00 00 00, v = —46 49 99 99 99, 6 = 100. 

[The average time, considering the data in Section 4.2.4, will be about 45 Jit.] 


SECTION 4.2.2 

1. u 0 V = U 0 —v — —v 0 u = — (v 0 —u) = —(v 0 u). 

2. u 0 x > u 0 0 = u, by (8), (2), (6); hence by (8) again, {u 0 x) 0 v > u0u. 
Similarly, (8) and (6) together with (2) imply that (u®i)0(w®i/) > («®i)® v. 

3 . u — 8.0000001, v = 1.2500008, w = 8.0000008; (u 0 v) 0 w = 80.000064, 
u 0 {v 0 w) = 80.000057. 
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4. Yes; let lfu v = w, where v is large. 

5. Not always; in decimal arithmetic take u = v = 9. 

6 . (a) Yes. (b) Only for b-\-p < 4 (try u = 1 — b ~ p ). [W. M. Kahan observes that the 
identity does hold whenever 6 _1 < f u < b~ 1/2 . It follows that 10(l0(10u)) = 
1 0 u for all u.] 

7. If u and v are consecutive floating binary numbers, u 0 v — 2u or 2v. When it is 

2v we often have < 2v®. For example, u — (.10... 001 ) 2 , v = (.10... 010 ) 2 , 

u ® v — 2v, and u® -(- v® = (. 10 ... 011 ) 2 - 

8 . (a) «; (b) (c) ~, (d) ~; (e) ~. 

9. j u — w\ < j u — d| -)- \v — w\ < ci min( 6 e “ —q , b Sv ~~ q ) €2 min( 6 e, ' —q , b ew ~ q ) < 
eib Cu ~ q E 2 b Cw ~ q < (£1 + £ 2 ) max( 6 eu ~ q , b tw ~ q ). The result cannot be strengthened 
in general, since for example we might have e u very small compared to both e v and e w , 
and this means that u — w might be fairly large under the hypotheses. 

10. We have (.a x ... a p _ia p ) b <g) (.9 ... 99) b = (.a 1 ... a p _ x (a p — l)) b if a p > 1; here 
“9” stands for b — 1 . Furthermore, (.a x ... a p _ia p ) b 0 (1.0 ... 0) b = (.ai... a p _i0) b , 
so the multiplication is not monotone if b > 2 and a p > 2. But when b — 2, this 
argument can be extended to show that multiplication is monotone; obviously the 
“certain computer” had b > 2 . 

11 . Without loss of generality, let x be an integer, |x| < b p . If e < 0, then t = 0. 
If 0 < e < p, then x — t has at most p —(— 1 digits, the least significant being zero. If 
e > p, then x — t = 0. [The result holds also under the weaker hypothesis \t\ < 26 e .j 

12 . Assume that e u = p, e v < 0, u > 0. Case 1 , u > Case (la), w = u + 1, 

v > e v = 0. Then u' = u or u~\- 1 , v' = 1 , u" = u, v" = 1 or 0. Case (lb), w == u, 
|u| < 3 . Then u' = u, v' = 0, u" = u, v" = 0. If \v\ = 3 and more general rounding 
is permitted we might also have v! = rtil, v" = ^ 1 . Case (lc), w = u — 1 , v < — 3, 
e v = 0. Then u' = u or u — 1 , v' — — 1 , u" = u, v" = —1 or 0. Case 2, u = b p ~ 1 . 
Case (2a), w = u 1, v > 3, e v — 0. Like (la). Case (2b), w = u, |u| < 3, v! > u. 
Like (lb). Case (2c), w = u, |uj < 3, u' < u. Then u' — u — j/b where v = j/b-\-v 1 
and |ui| < \b ~ 1 for some positive integer j < 3 b; we have u' — 0 , u" — u, u" = j/ 6 . 
Case (2d), w < u. Then w = u — j/b where v = — j/b + v\ and |ui| < \b ~ 1 for 
some positive integer j < 6 ; we have {v',u") = (— j/b,u), and (u^u") = (u,— j/b ) or 
(u —1/6, (1— j)/b), the latter case only when v\ = 36 ”” 1 . In all cases u0u' = m — u', 
vQv' = v — v f , uQu" — u — u", D 0 / = v — v", round(w — u — v) — w — u — v. 

13. Since round(z) = 0 iff x = 0, we want to find a large set of integer pairs (m, n) 
with the property that m 0 n is an integer iff m/n is. Assume that \m\, \n\ < 6 P . If 
m/n is an integer, then m Q) n = m/n is also. Conversely if m/n is not an integer, 
but m 0 n is, we have l/|nj < |m 0 n — m/n\ < 3 |ra/n| 6 1—p , hence |m| > 26 p—1 . 
Our answer is therefore to require |m| < 2b p 1 and 0 < \n\ < b p . (Slightly weaker 
hypotheses are also possible.) 

14. \(u®v)®w — uvw\ < |(u(g)^)(g)u; — (u(g)u)iy| +|w| — uv\ < 6^®^^ + 

6 w q w 8u(g)v ^ (1 ~h 6 )( 5 ( U 0 U )<gi U >. Now \e^u<^>v)^w s U (gi( V 0 U i)| ^ 2, so we may take 
£ = i(l + 6 ) 6 2 - p . 
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15. u < v implies that (u ® u) (7) 2 < (u 0 v) 0 2 < (u 0 v) 0 2, so the condition 
holds for all u and v iff it holds whenever u — v. For base 6 = 2, the condition is 
therefore always satisfied (barring overflow); but for 6 > 2 there are numbers v 4 w 
such that v 0 v = w 0 w, hence the condition fails. [On the other hand, the formula 
u ® (4 © u ) 0 2) does give a midpoint in the correct range. Proof: It suffices to show 
that u (v @ u) 0 2 < v, i.e., {y 0 u) 0 2 < v — u; and it is easy to verify that 
round(sround(x)) < x for all x > 0 .] 

16. (a) Exponent changes occur at ]T 10 = 11.111111, = 101.11111, X^goi = 

1001.1102, E 9001 = 10001.020, £90009 = 100000.91, £ 900819 = 1000000.0; therefore 

E — 1109099 1 
1000000 

(b) ^Zi<fc<n 1-2345679 = 1224782.1, and (14) tries to take the square root of 

—.0053187053. But (15) and (16) are exact in this case. [If Xk = 1 + [(k — 1)/2J 10 7 , 
(15) and (16) have errors of order n. See Chan and Lewis, CACM 22 (1979), 526-531, 
for further results on the accuracy of standard deviation calculations.] 

(c) We need to show that u ® ((v © u) 0 k ) lies between u and v; see exercise 15. 


17. FCMP STJ 

9F 

Floating point comparison subroutine 

J0V 

0FL0 

Ensure overflow is off. 

STA 

TEMP 


LDAN 

TEMP 

v «- V. 


(Copy here lines 07-20 of Program 4.2.1A.) 



LDX 

FV(0:0) 

Set rX to zero with sign of f v . 


DEC1 

5 



JIN 

*+2 



ENT1 

0 

Replace large difference in exponents 


SRAX 

5,1 

by a smaller one. 


ADD 

FU 

rA +— difference of operands. 


J0V 

7F 

Fraction overflow: not ~. 


CMPA 

EPSILON (1: 

: 5) 


JG 

8F 

Jump if not ~. 


JL 

6F 

Jump if ~. 


JXZ 

9F 

Jump if ~. 


JXP 

IF 

If |rA| = e, check sign of rA X rX. 


JAP 

9F 

Jump if ~. (rA 4 0) 


JMP 

8F 


7H 

ENTX 

1 



SRC 

1 

Make rA nonzero with same sign. 


JMP 

8F 


1H 

JAP 

8F 

Jump if not ~. (rA 4 0) 

6H 

ENTA 

0 


8H 

CMPA 

=0= 

Set comparison indicator. 

9H 

JMP 

* 

Exit from subroutine. 1 


19. Let 7^ = 6k — 7}k = <?k — 0 for k > n. It suffices to find the coefficient of x\, 
since the coefficient of Xk will be just the same except with all subscripts increased by 
k — 1. Let ( fk, gk ) denote the coefficient of x\ in (Sk — Ck,Ck ) respectively. Then f\ = 
(l + ? 7 i)(l — 61O1— 7i0iCTi), gi = (1-f 5 i)(l-f 771X71+0-1+71(7!), 

and f k = (1 — 7/cCTfc — 6 k (Jk — ykf>kOk)fk-\ + (7* — Vk + Ikfa + ikVk + 7 kdkijk + 

7 kV^k 4- + ik6kVk(Jk)gk-i, gk = o k { 1 + 7 * 0(1 4 - &k)fk -1 — (14 4 )( 7 k 4 
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>?? fc -f rjkOk + 7 fct?fc<T fc )gfc- 1 , for 1 < k < n. Thus / n = 1 © r?i — 71 + (4 n terms 
of 2 nd order) -f- (higher order terms) = 1 —|— 771 — '■yi ©. 0 (ne 2 ) is sufficiently small. 
[The Kahan summation formula was first published in CACM 8 (1965), 40; cf. Proc. 
IFIP Congress (1971), 2, 1232. For another approach to accurate summation, see R. J. 
Hanson, CACM 18 (1975), 57-58. See also G. Bohlender, IEEE Trans. C-26 (1977), 

621-632, for algorithms that compute round©©-t-x n ) and round(xi... x n ) exactly, 

given { 21 ,..., x n }-] 

20 . By the proof of Theorem C, (47) fails for e w = p only if |u| © \ > \w — u\ > 
b p ~~ 1 ©fr~ x ; hence \ f u \ > |/ w | > 1 — (^b — 1 )b~ p . This rather rare case, in which \f w \ 
before normalization takes its maximum value 2, is necessary and sufficient for failure. 

21. (Solution by G. W. Veltkamp.) Let c = 2 ^ p/2 ^ © 1 ; we may assume that p > 2 , 
so c is representable. First compute v! = u © c, ui =(u0«')0u', u 2 — u © ui; 
similarly, v' = v (g> c, v 1 = (v © V) © v', u 2 = u © V\. Then set w <— u © v, 
w' <- (((ui © vi © tu) © (ni © u 2 )) © [u 2 0 vi)) © (u 2 0 v 2 ). 

It suffices to prove this when u, v > 0 and e u = e u = p, so that u and v are 
integers £ [2 P—1 ,2 P ). Then u = u\ © u 2 where 2 P—1 < Ui < 2 P , Ui mod2^ p/2 ^ = 0, 
and |u 2 | < 2 ^ p/2 "' -1 ; similarly v — vi©u 2 . The operations during the calculation of w 1 
are exact, because w — UiV\ is a multiple of 2 P ~ 1 such that — U 1 V 11 < |w — uv\-\- 
\u 2 V 1 + U\v 2 © u 2 v 2 \ < 2 P—1 © 2 p+l ' p/2 ' 1 © 2 P__1 ; and similarly |w — U\V\ — U\V 2 \ < 
\w — uv j © |u 2 v| < 2 P—1 © 2^ p ' /2 ^ — 1 + p , where 161 ——UiV 2 is a multiple of 2^ p/2 \ 

22 . We may assume that b p ~~ x < u, v < 6 P . If uv < b 2p ~ 1 , then Xi = uv — r where 
l r l < 2 b p 1 > hence x 2 = round©— r/u) = xo (since |r/u| < %b p ~ 1 /b p ~ 1 < © and 
equality implies v = b p ~ 1 hence r = 0). If uv > & 2p ~ 1 , then x\ — uv — r where 
)r| < \b p , hence X\jv < u — rfv < 6 P + 5 b and x 2 < b p . If x 2 = b p , then X 3 = X\ 
(for otherwise ( b p — \)v < xi < b p (v — 1)). If x 2 < b p and xi > b 2p ~~ l , then let 
x 2 = xi jv q where |q| < © we have 23 = round(xi + qv) = X\. Finally if x 2 < b p 
and Xi = b 2p—1 and 23 < b 2p ~ x , then 24 = x 2 by the first case above. This situation 
arises, for example, when b = 10, p = 2, u = 19, v = 55, 21 = 1000, x 2 — 18, 
23 = 990. 

23. Let [u\ = n, so that u (mod) 1 = u Q n = u — n-\- r where |r| < \b~ p ; we wish 
to show that round(n — r) = n. The result is clear if [n| > 1; and r = 0 when n = 0 
or 1 , so the only subtle case is when n = —1, r = — \b~~ p . The identity fails iff b is 
a multiple of 4 and — b ~~ 1 < u < — b~ 2 and umod26 —p = p (e.g., p = 3, b — 8 , 
u = — (.0124) 8 ). 

24. Let u — [iq, u r ], v = [vi, v r ]. Then u ® v — [ui ^ Vi, u r A v r ], where 2 A y = 
y Ax, x A -[-0 = 2 for all x, x A —0 = 2 for all 2 7 ^ -f 0 , 2 A -f 00 = +00 for all 
2 7 ^ — 00 , and 2 A —00 needn’t be defined; 2 y y — —^((— 2 ) A(—y)). If 2 © y would 
overflow in normal floating point arithmetic because 2 © y is too large, then 2 A y is 
©00 and 2 V y is the largest representable number. 

For subtraction, let u © v = u © (—u), where —u = [— v r , — Vi]. 

Multiplication is somewhat more complicated. The correct procedure is to let 

u © v = [ min(uz ^ v h ui^ v r ,u r ^ vi,u r ^ v r ), max(uz A vi, m A v r , u r A vi, u r A v r )], 

where 2 Ay = y Ax, 2 A(—y) = —( 2 ^y) = (— 2 ) Ay; 2 A ©0 = (©0 for 2 > 0 , 
—0 for 2 < 0 ); 2 A— 0 — — (xA© 0 ); 2 A ©00 = (+00 for 2 > © 0 , —00 for 
2 < —0). (It is possible to determine the min and max simply by looking at the signs 
of Ui, u r , vi, and v r , thereby computing only two of the eight products, except when 
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ui < 0 < u r and vi < 0 < v r ; in the latter case we compute four products, and the 
answer is [min(iq V v r , u r V Vi), max(ui A v\, u r A u r )j.) 

Finally, u 0 v is undefined if vi < 0 < v r ) otherwise we use the formulas for 
multiplication with vi and v r replaced respectively by v~ l and v~ l , where x A y~ x = 
x Ay, x vr 1 = x s/y, (±0) _1 = ±00, (±oo ) -1 = ±0. 

[Cf. E. R. Hansen, Math. Comp. 22 (1968), 374-384. An alternative scheme, in 
which division by 0 gives no error messages and intervals may be neighborhoods of 00, 
has been proposed by W. M. Kahan. In Kahan’s scheme, for example, the reciprocal of 
[— 1 , + 1 ] is [+ 1 , — 1 ], and an attempt to multiply an interval containing 0 by an interval 
containing 00 yields [— 00, +00], the set of all numbers. See Numerical Analysis , Univ. 
Michigan Engineering Summer Conf. Notes No. 6818 (1968).] 

25. Cancellation reveals previous errors in the computation of u and v. For example, 
if e is small, we often get poor accuracy when computing f(x + e) © f{x), because 
the rounded calculation of f{x - f~ e) destroys much of the information about e. It is 
desirable to rewrite such formulas as e © g(x, e), where g(x, e) = (f(x + e) — f(x))/e is 
first computed sym bolical ly. Thus, if f(x) = x 2 then g(x, e) = 2x -f e; if f(x) — >/x 
then g(x, e) — \f(yjx~+l + y/x). 

26. See Math. Comp. 32 (1978), 227-232. 


SECTION 4.2.3 


1. First, {w m ,wi) = (.573, .248); then WmVi/v m = .290; so the answer is (.572, .958). 
This in fact is the correct result to six decimals. 

2. The answer is not affected, since the normalization routine truncates to eight places 
and can never look at this particular byte position. (Scaling to the left occurs at most 
once during normalization, since the inputs are normalized.) 

3. Overflow obviously cannot occur at line 09, since we are adding two-byte quantities, 
or at line 22 , since we are adding four-byte quantities. In line 30 we are computing the 
sum of three four-byte quantities, so this cannot overflow. Finally, in line 32, overflow 
is impossible because the product f u f v must be less than unity. 

4. Insert “JOV 0FL0; ENT1 0” between lines 03 and 04. Replace lines 21-22 by 
“ADD TEMP(ABS) ; JN0V *+2; INC1 1”, and change lines 28-31 to “SLAX 5; ADD TEMP; 
JN0V *+2; INC1 1; ENTX 0,1; SRC 5”. This adds five lines of code and only 1, 2, or 3 
units of execution time. 


5. Insert “JOV 0FL0” after line 06. Change lines 22, 31, 39 respectively to “SRAX 0,1”, 
“SLAX 5”, “ADD ACC”. Between lines 40 and 41, insert “DEC2 1; JN0V DN0RM; INC2 1; 
INCX 1; SRC 1”. (It’s tempting to remove the “DEC2 1” in favor of “STZ EXPO”, but 
then “INC2 1” might overflow rI2!) This adds six lines of code; the running time 
decreases by 3u, unless there is fraction overflow, when it increases by 7u. 


6 . DOUBLE STJ EXITDF 
ENTX 0 
STA TEMP 
LD2 TEMP(EXP) 
INC2 QQ-a 


Convert to double precision: 
Clear rX. 

rI 2 4 — e. 

Correct for difference in excess. 
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STZ 

EXPO 

EXPO 0. 

SLAX 

1 

Remove exponent. 

JMP 

DN0RM 

Normalize and exit. 

SINGLE STJ 

EXITF 

Convert to single precision: 

J0V 

0FL0 

Ensure overflow is off. 

STA 

TEMP 


LD2 

TEMP(EXPD) 

rI2 <— e. 

DEC2 


Correct for difference in excess. 

SLAX 

2 

Remove exponent. 

JMP 

NORM 

Normalize, round, and exit. | 


7. All three routines give zero as the answer if and only if the exact result would be 
zero, so we need not worry about zero denominators in the expressions for relative error. 
The worst case of the addition routine is pretty bad: Visualized in decimal notation, 
if the inputs are 1.0000000 and .99999999, the answer is b ~ 7 instead of 6 ~ 8 ; thus the 
maximum relative error 8 i is b — 1 , where b is the byte size. 

For multiplication and division, we may assume that both operands are positive 
and have the same exponent QQ. The maximum error in multiplication is readily 
bounded by considering Fig. 4: When uv > 1/6, we have 0 < uv — u(g)v < 36 —9 + 
(6 — 1)6 9 , so the relative error is bounded by (6 4- 2)6 —8 . When 1/6 2 < uv < 1/6, 
we have 0 < uv — < 36 —9 , so the relative error in this case is bounded by 

3b ~~ 9 /uv < 36~ 7 . We take 62 to be the larger of the two estimates, namely 3b~ 7 . 

Division requires a more careful analysis of Program D. The quantity actually 
computed by the subroutine is a — 6 — 6 e((a — 8 "){P — 6 ') — 8 '") — 8 n where a = 
(u m -(- eui)/bv m , p = vi/bvm, and the nonnegative truncation errors ( 8 , 6 ', 8 ”, 6 '") are 
respectively less than ( 6 —10 , b ~ 5 , 6 _ “ 5 , 6 ~~ 6 ); finally <5 n (the truncation during normaliza¬ 
tion) is nonnegative and less than either b ~~ 9 or b ~~ 8 , depending on whether scaling 
occurs or not. The actual value of the quotient is a/(l —(— 6e/3) = a — beaP -f- b 2 a.p 2 8 "", 
where 8 "" is the nonnegative error due to truncation of the infinite series ( 2 ); here 
8 "" < e 2 = 6 ~ 10 , since it is an alternating series. The relative error is therefore the 
absolute value of (be 8 ' -|- be 8 "P/a -|- beS'"/a) — [8 / a + bc 8 ' 8 "/a + b 2 p 2 8 ”" 8 n /a), 

times (l-\-bep). The positive terms in this expression are bounded by 6 9 —)-6 8 -|— 6 8 , 
and the negative terms are bounded by b ~ 8 + b ~ 12 -\- b ~ 8 plus the contribution by the 
normalizing phase, which can be about b ~ 7 in magnitude. It is therefore clear that the 
potentially greatest part of the relative error comes during the normalization phase, 
and that 63 = (6 -(- 2 ) 6 —8 is a safe upper bound for the relative error. 

8 . Addition: If e u < e v -(- 1, the entire relative error occurs during the normalization 
phase, so it is bounded above by b~ 7 . If e u > e v 2, and if the signs are the same, 
again the entire error may be ascribed to normalization; if the signs are opposite, the 
error due to shifting digits out of the register is in the opposite direction from the 
subsequent error introduced during normalization. Both of these errors are bounded 
by b~ 7 , hence 61 = 6 ~ 7 . (This is substantially better then the result in exercise 7.) 

Multiplication: An analysis as in exercise 7 gives 82 = (6 + 2 ) 6 —8 . 


SECTION 4.2.4 

1 . Since fraction overflow can occur only when the operands have the same sign, 
this is the probability that fraction overflow occurs divided by the probability that the 
operands have the same sign, namely, 7%/(^(91%)) « 15%. 
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3. log 10 2.4 — log 10 2.3 1.84834%. 

4 . The pages would be uniformly gray (same as “random point on a slide rule”). 

5. The probability that 10 fu < r is (r — 1)/10 (r — 1)/100 -f • • • = (r — l)/9. 
So in this case the leading digits are uniformly distributed; e.g., leading digit 1 occurs 
with probability 

6 . The probability that there are three leading zero bits is log 16 2 = \ \ the probability 
that there are two leading zero bits is log 16 4 — log 16 2 = and similarly for the other 
two cases. The “average” number of leading zero bits is 1^, so the “average” number of 
“significant bits” is p-\- The worst case, p— 1 bits, occurs however with rather high 
probability. In practice, it is usually necessary to base error estimates on the worst 
case, since a chain of calculations is only as strong as its weakest link. In the error 
analysis of Section 4.2.2, the upper bound on relative rounding error for floating hex 
is 2 1—p . In the binary case we can have p 1 significant bits at all times (cf. exercise 
4.2.1-3), with relative rounding errors bounded by 2 —1—p . Extensive computational 
experience confirms that floating binary (even with p-bit precision instead of p -j- 1 ) 
produces significantly more accurate results than (p + 2 )-bit floating hex. 

Tables 1 and 2 show that hexadecimal arithmetic can be done a little faster, since 
fewer cycles are needed when scaling to the right or normalizing to the left. But this 
fact is insignificant compared to the substantial advantages of b = 2 over other radices 
(cf. also Theorem 4.2.2C and exercises 4.2.2-13, 15, 21), especially since floating binary 
can be made as fast as floating hex with only a tiny increase in total processor cost. 

7. For example, suppose that ^ m (F(10 fcm ' 5 fc ) — F(10 fcm )) = Iog5 fc /logl0 fc and 

also that 1 4 fc ) — F( 10 fc ™)) = log4*/ log 10 fc ; then 

]T (F(10 km ■ 5 k ) - F(10 fcm ■ 4*)) = log 10 - 

m 

for all k. But now let e be a small positive number, and choose 6 > 0 so that F(x) < e 
for 0 < x < 6 , and choose M > 0 so that F(x) >1 — e for x > M. We can take k 
so large that 10 —fc • 5 fc < 6 and 4 fc > M; hence by the monotonicity of F, 

Y] (F(10 km ■ 5 fc ) — F(10 fcm • 4 fc )) < Y ’ 5 fc ) — F(10 fc(m-1) • 5 fc )) 

m m < 0 

+ Y (^(10 fc(m+1) • 4 fc ) - F(10 fcm • 4 fc )) 

m> 0 

= F(10 -fc 5 fc ) + 1 — F(10 fc 4 fc ) < 2e. 


8. When s > r, Po(10 n 5) is 1 for small n, and 0 when [10 n 5j > [10 n rJ. The least n 
for which this happens may be arbitrarily large, so no uniform bound can be given 
for No(e) independent of s. (In general, calculus textbooks prove that such a uniform 
bound would imply that the limit function So(s) would be continuous, and it isn’t.) 

9. Let qi, qi, ... be such that Po(n) — + <72 (”T 1 ) H-^ or n - ^ follows 

that P m (n) — r _m qi( 7l " 0 " 1 ) + 2~ w q 2 ( n 7 1 ) H- for m and n • 
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10. When 1 < r < 10 the generating function C(z) has simple poles at the points 
1 -f w n , where w n = 2irni/\n 10 , hence 


C(z) = 


log 10 r — 1 , V"' 1 H - w n e 


l— z Wn 


+ E{z) 


n^O 


(In 10)(z — 1 — Wn) 
where B(z) is analytic in the entire plane. Thus if 9 — arctan( 27 r/ In 10), 

2 


Cm — log 10 r — 1 — 


In 10 


£»( 

n>0 


p—Wn lnr _ 

W n (l W n ) m 


+ e 


m 


^ w r _i , sin(m^ + 2?r log 10 r) - sin(mg) 

Sl ° + 7 r(l -f~ 47 T 2 /(In 10) 2 ) m / 2 + ’ 


1 


.(1 + 167r 2 /(ln 10 ) 2 ) m / 2 . 


11. When (logjj U) mod 1 is uniformly distributed in (0,1), so is (log 5 l/[/)modl — 
(1 — log b U) mod 1 . 

12. We have h(z) = f* f(x) dx g(z/bx)/bx 4- fl f{x) dx g{zfx)jx, and the same holds 
for l{z) = f* /b f(x) dx l(z/bx)/bx + f] f(x) dx l(z/x)/x > hence 

h(z)-l(z) = f g{z/bx) - l(z/bx) I" g (z/x) - l(z/x) 

K%) Ji/b J{ l(z/bx) ^ J z l{z/x) 

Since f(x) > 0, \(h{z) — l(z))/l{z)\ < f* /b f (x) dx A(g) -f- f* f(x)dxA(g) for all 2 , hence 
A(h) < A(g). By symmetry, A(h) < A(f). [Bell System Tech. J. 49 (1970), 1609-1625.] 

13. Let X = (log b U) mod 1 and Y — (log b V)modl, so that X and Y are inde¬ 
pendently and uniformly distributed in [0,1). No left shift is needed if and only if 
X 4* Y > 1 , and this occurs with probability 5 . 

(Similarly, the probability is \ that floating point division by Algorithm 4.2.1M 
needs no normalization shifts; this needs only the weaker assumption that both of the 
operands independently have the same distribution.) 

14. For convenience, the calculations are shown here for b = 10. If /c = 0, the 
probability of a carry is 


0 l 




578 ANSWERS TO EXERCISES 


4 . 2.4 


(The latter integral is essentially a “dilogarithm.”) Hence the probability of a carry 
when k = 0 is ( 1 /In 10) 2 (tt 2 /6 — 2£ n >! l/n 2 10 n ) = .27154. [Note: When b = 2 and 
k = 0 , fraction overflow always occurs, so this derivation proves that ]P n>1 l/ n 2 2 n = 
7 t 2 /12 — j(ln 2) 2 .] 

When k > 0, the probability is 

( 1 y f 10l ~ k dy r dx = (J_\ 2 ( y- _J _ y- 1 

Vln 10/ J\q —fc y J io j/ £ Vln 10/ l * ^ ?i^l0 n ^ * ^ 2 ^Qn(fc+i) 

Thus when 6 = 10, fraction overflow should occur with probability .272po -j- .017pi -f- 

. 002^2 4-■ When b = 2 the corresponding figures are po + .655pi + .288p2+.137p3-+- 

.067p4 ■ ■ ■ • 

Now if we use the probabilities from Table 1 , dividing by .91 to eliminate zero 
operands and assuming that the probabilities are independent of the operand signs, we 
predict a probability of about 14 percent when b = 10, instead of the 15 percent in 
exercise 1. For b = 2, we predict about 48 percent, while the table yields 44 percent. 
These results are certainly in agreement within the limits of experimental error. 

15. When k = 0, the leading digit is 1 if and only if there is a carry. (It is possible for 
fraction overflow and subsequent rounding to yield a leading digit of 2 , when b > 4 , 
but we are ignoring rounding in this exercise.) The probability of fraction overflow is 
.272 < log 10 2, as shown in the previous exercise. 

When k > 0, the leading digit is 1 with probability 



f 1 \ 2 f r 1 'dyf dx 

Vln 10/ yj 10 -k yJ oT \f*<>-y l0 x 


)< (my r 1- ^ [ ^ 

J Vln 10/ \J 10 -k y J i < x < 2 X J 


logio 2. 


16. To prove the hint [which is due to Landau, Prace matematyczno-fizyczne 21 (1910), 

103-113], assume first that limsupa n = X > 0. Let e = X/(X + AM) and choose N 
so that |ai -[-••• -j- a n | < ygtXn for all n > N. Let n > N/( 1 — e), n > 5/e be such 
that a n > £X. Then, by induction, a n ~k > a n — kM/(n~en) > |X for 0 < k < en, 
and J2n-en<k<n ak > i X ( cn “ 1 ) > But 

|^->n— en<k<n® lc X^l<fc<n—en | 

since n — en > N. A similar contradiction applies if liminf a n < 0. 

Assuming that P m +i(n) -+ X as n -> oo, let a k = P m {k ) — X. If m > 0, the 
dk satisfy the hypotheses of the hint (cf. Eq. 4.2.2-15), since 0 < P m (k) < 1; hence 
Pm{'d) —* X. 

17. See Fibonacci Quarterly 7 (1969), 474-475. (Persi Diaconis [Ph.D. thesis, Harvard 
University, 1974] has shown among other things that the definition of probability by 
repeated averaging is weaker than harmonic probability, in the following precise sense: 
If lim TO --»oo lirriinf n ->oo L m (n) — lim m ->oo limsup n _ >00 F m (n) — X then the harmonic 
probability is X. On the other hand the statement “ 10 fc2 < n < 10 fc2+fc for some 
integer k > 0” has logarithmic probability while repeated averaging never settles 
down to give it any particular probability.) 
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18. Let p(a) = P{L a ) and p(a, b ) = J2 a<k<b p{k) for 1 < a < b. Since L a — L 10 a U 

Lioa+i U • ■ ■ U Lioa +9 for all a, we have p(a) = p(lOa, 10(a + 1)) by (i). Furthermore 
since P(S) = P(2S) + P{2S -(- 1 ) by (i), (ii), (iii), we have p(a) = p( 2 a, 2(a -)- 1 )). It 
follows that p(a, b) — p( 2 m 10 n a, 2 m 10 n 6) for all m, n > 0 . 

If 1 < b/a < b'/a', then p(a, b) < p(a', b'). The reason is that there exist integers 
m, n, m', n' such that 2 rn 'lO n a' < 2 m 10 n a < 2 m 10 n 6 < 2 m/ 10 n V as a consequence 
of the fact that log 2 /log 10 is irrational, hence we can apply (v). (Cf. exercise 3.5-22 
with k = 1 and U n = n log 2 / log 10 .) In particular, p(a) > p(a + 1 ), and it follows 
that p(a, b)/p(a, 6 + 1 ) > (6 — a )/(6 + 1 — a). (Cf. Eq. 4.2.2-15.) 

Now we can prove that p(a,b ) = p(a',b') whenever b/a = b'/a'; for p(a,b ) = 
p( 10 n a, 10 n 6) < c n p( 10 n a, 10 n 6 — 1 ) < c n p(a',b'), for arbitrarily large values of n, 
where c n — 10 n (6 — a)/(l 0 n (6 — a) — l) = 1 + O( 10 —n ). 

For any positive integer n we have p(a n , b n ) = p(a n , ba n ~ 1 ) + p( 6 a n—1 , b 2 a n ~ 2 ) + 
• • • + p( 6 n - 1 a, b n ) = np(a, b). If 10 w < a n < 10 m+1 and 10 m ' < b n < 10 m ' +1 , then 
p(10 m+ 1 ,10 m ') < p(a n , b n ) < p{ 10 m , 10 w ' +1 ) by (v). But p(l, 10) = 1 by (iv), hence 

p( 10 m , 10 m/ ) = m! — m for all m' > m. We conclude that Ll°gio^ n J — U°Sio 0/71 \ — ^ ^ 
np{a , 6 ) < [log 10 6 n J + [log 10 a n J + 1 for all n, and p(a, b) = log 10 ( 6 /o). 

[This exercise was inspired by D. I. A. Cohen, who proved a slightly weaker result 
in J. Combinatorial Theory (A) 20 (1976), 367-370.] 


SECTION 4.3.1 

2. If the 2 th number to be added is m = (unUi2 ... Ui n )b, use Algorithm A with 
step A2 changed to the following: 

A 2 '. [Add digits.] Set 

Wj <— {u\j + • ■ ■ + Umj + k) mod 6 , and k <— |_(uij + • • • + u m j + k)/b\. 
(The maximum value of k is m — 1 , so step A3 would have to be altered if m > 6.) 


ENT1 

N 

1 


J0V 

0FL0 

1 

Ensure overflow is off. 

ENTA 

0 

1 

fc «— 0. 

2H ENT2 

0 

N 

(rI2 = next value of k) 

ENT3 

M+N-N,1 

N 

(LOC(iiij) = U + n(i — 1) + jO 

3H ADD 

U,3 

MN 

rA rA + u l3 . 

JN0V 

*+2 

MN 


INC2 

1 

K 

Carry one. 

DEC3 

N 

MN 

Repeat for m > i > 0. 

J3P 

3B 

MN 

(rI3 = n(i — 1) + j) 

STA 

W,1 

N 

Wj «— rA. 

ENTA 

0,2 

N 

k <— rI2. 

DEC1 

1 

N 


J1P 

2B 

N 

Repeat for n > j > 0. 

STA 

W 

1 

Store final carry in wo- 1 


Running time, assuming that K = %MN, is 5.5 MN + IN + 4 cycles. 
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4. We may make the following assertion before Al: “n > 1 ; and 0 < m, v z < 5 for 
1 < i < n” Before A2, we assert: “1 < j < n; 0 < Ui,Vi < 5 for 1 < i < n; 
0 < Wi < 5 for j < i < n; 0 < /c < 1 ; and 

("Uj-)-! . . . -j - (^j-f-1 • • • V n )b (^VJj-j-i ■ ■ • W n )b- 

The latter statement means more precisely that 

v t b n ~’ = kb n ~ J + XI w ‘ bn ~‘- 

j<t<n j <t<n j < t < n 


Before A3, we assert: “1 < j < n; 0 < m, Vi < 6 for 1 < i < n; 0 < ^ < b for 
j < i < n-, 0 < k < 1; and (u 3 ... u n )b + (vj ... v n ) b = (kw 3 ... w n )b .” After A3, we 

assert that 0 < w % < b for 1 < i < n\ 0 < Wq < 1 ; and {u\ ... u n )b + (vi ... v n )b = 

(iy 0 ■ • • w n )b- 

It is a simple matter to complete the proof by verifying the necessary implications 
between the assertions and by showing that the algorithm always terminates. 

5. Bl. Set j <— 1 , wo <— 0 . 

B2. Set t <— Uj Vj, Wj +- t mod b, i <— j. 

B3. If t > b, set i <- i — 1 , t <— w t 1 , w z <— t mod b, and repeat this step until 

t < b. 

B4. Increase j by one, and if j < n go back to B2. | 

6 . Cl. Set j «— 1 , i <— 0, r <— 0. 

C2. Set t <— Uj + Vj. If t > b, set Wi <- r + 1 , Wk 0 for i < k < j] then set 

i <— j and r *— t mod b. Otherwise if t < b — 1 , set Wi r, Wk b — 1 for 

i < k < j; then set i «— j and r <— t. 

C3. Increase j by one. If j < n, go back to C2; otherwise set w t <— r, and 

Wk *— b — 1 for i < k < n. | 

7. When j — 3, for example, we have k = 0 with probability [b-\-l)/2b] k — 1 with 
probability ((6 — 1)/26)(1 — 1/5), namely the probability that a carry occurs and that 
the preceding digit wasn’t b — 1 ; k = 2 with probability ((b — l)/ 2 b)(l/b)(l — 1 / 5 ); 
and k = 3 with probability ((5 — 1)/25)(1/5)(1/5)(1). For fixed k we may add the 
probabilities as j varies from 1 to n; this gives the mean number of times the carry 
propagates back k places, 


mt= 

As a check, we find that the average number of carries is 


mi 4" 2 m 2 + nm n 




in agreement with ( 6 ). 
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ENN1 

N 

1 

2H 

LDA 

W+N+1,2 K 

J0V 

0FL0 

1 


INCA 

1 

K 

STZ 

W 

1 


STA 

W+N+1,2 K 

1H LDA 

U+N+l,1 

N 


DEC2 

1 

K 

ADD 

V+N+1,1 

N 


J0V 

2B 

K 

STA 

W+N+1,1 

N 

3H 

INC2 

1 

N 

JN0V 

3F 

N 


J2N 

IB 

N 


ENT2 -1,1 L | 

The running time depends on L, the number of positions in which Uj -f- Vj > b ; and 
on K, the total number of carries. It is not difficult to see that K is the same quantity 
that appears in Program A. The analysis in the text shows that L has the average 
value N((b — l)/2f>), and K has the average value — b ~ 1 — 2 — • • • — b~ n ). 

So if we ignore terms of order 1/b, the running time is 9N 7 K -f- 3 « 13iV -(- 3 

cycles. 

Note: Since a carry occurs almost half of the time, it would be more efficient to 
delay storing the result by one step. This leads to a somewhat longer program whose 
running time is approximately 12N + 5 cycles, based on the somewhat more detailed 
information calculated in exercise 7. 

9. Replace “ 6 ” by “b n — 3 ” everywhere in step A2. 

10. If lines 06 and 07 were interchanged, we would almost always have overflow, but 
register A might have a negative value at line 08, so this would not work. If the 
instructions on lines 05 and 06 were interchanged, the sequence of overflows occurring 
in the program would be slightly different in some cases, but the program would still 
be right. 

11. (a) Set j <— 1 ; (b) if u 3 < v 3 , terminate [u < u]; if u 3 = v 3 and j = n, terminate 
[u = v]; if Uj — Vj and j < n, set j «— j 1 and repeat (b); if u 3 > v 3 , terminate 
[u > t>]. This algorithm tends to be quite fast, since there is usually low probability 
that j will have to get very high before we encounter a case with u 3 ^ v 3 . 

12 . Use Algorithm S with u 3 = 0 and v 3 = w 3 . Another “borrow” will occur at the 
end of the algorithm; this time it should be ignored. 


ENTX 

N 

1 

ADD 

CARRY 

N 

J0V 

0FL0 

1 

JN0V 

*+2 

N 

ENTX 

0 

1 

INCX 

1 

K 

2H STX 

CARRY 

N 

STA 

W,1 

N 

LDA 

U,1 

N 

DEC1 

1 

N 

MUL 

V 

N 

J1P 

2B 

N 

SLC 

5 

N 

STX 

W 

1 


The running time is 23AT -j- K + 5 cycles, and K is roughly %N. 

14. The key inductive assertion is the one that should be valid at the beginning of 
step M4; all others are readily filled in from this one, which is as follows: “1 < i < n; 
1 < j < m; 0 < u r < b for 1 < r < n; 0 < v r < b for 1 < r < m; 0 < w r < b for 
j<r<ra-|-n; 0 </c< 6 ; and 

(Wj +1 • • • Wm+n)b 4 - kb rn + n ~ l ~ 3 =uX {Vj+1. .. v m )b + (ui+i. •. u n ) b X Vjb m ~ 3 .” 
(For the precise meaning of this notation, see the answer to exercise 4.) 
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15. The error is nonnegative and less than (n — 2)b~ n ~ 1 . [Similarly, if we ignore the 
products with i-\-j > n -f- 3, the error is bounded by (n — 3 )b~ n ~ 2 , etc.; but, in some 
cases, we must compute all of the products if we want to get the true rounded result.] 

16. SI. Set r <— 0, j <— 1. 

52. Set Wj *— [(rb -j- Uj)/v j, r <— (rb -f- Uj) mod v. 

53. Increase j by 1 , and return to S 2 if j < n. | 

17. u/v > uob n /(v\ + l )^ -1 = 6(1 — l/(vi + 1)) > 6(1 — l/(6/2)) = 6 — 2. 

18. (u 0 b -j- ui)/(v! + 1 ) < u/(v i -j- l) 6 n—1 < u/v. 

19. u — qv < u — qv\b n ~ 1 — qv 2 b n ~ 2 — u 2 b n ~ 2 -\ -f u n + rb n ~ 1 — qv 2 b n ~ 2 < 

6 n ~ 2 (u 2 + 1 + rb — qv 2 ) < 0 . Since u — qv < 0, q < q. 

20. If q < q — 2, then u < (q — i)v < q(v\b n ~ 1 + («2 + l) 6 n ” 2 ) — v< qv\b n ~ 1 -f 

qv 2 b n ~ 2 + 6 n—1 — v < qvib n ~ 1 + ( 6 f + u 2 ) 6 n - 2 6 n “ 1 — v = u 0 6 n + Ui 6 n_1 + 

u 2 b n ~ 2 4- 6 n—1 — v < iio 6 n + 'Ui 6 n—1 -]- u 2 b n ~ 2 < w. In other words, u < u, and 
this is a contradiction. 

21 . (Solution by G. K. Goyal.) The inequality v 2 q < br u 2 implies that we have 
q < (uob 2 + uib + u 2 )/{v\b -f v 2 ) < u/{{vib -j- v 2 )b n ~ 2 ). Now umodv = u — qv — 
v{l — a) where 0 < a — q — u/v < q — u/v < u(l/((vib + u 2 ) 6 n—2 ) — l/v) — 
u(v 3 b n ~ 3 -}-*•• )/{{v\b -f- v 2 )b n ~ 2 v) < u/{v\bv) < q/(v\b) < (6 — l)/(vi 6 ), and this is 
at most 2/6 since Vi > 5 ( 6 — 1 ). 

22 . Let u = 4100, v — 588. We first try q — = 8 , but 8 • 8 > 10(41 — 40) + 0. 

Then we set q = 7, and now we find 7 • 8 < 10(41 — 35) + 0. But 7 times 588 
equals 4116, so the true quotient is q = 6 . (Incidentally, this example shows that 
Theorem B cannot be improved under the given hypotheses, when 6 = 10.) 

23. Obviously v[b/{v + 1 )J < (v -j- l)\bf(v + 1 )J < (v + l)V( v + 1) = 6 ; also if 
v > [6/2J we obviously have v[b/(v + 1)J > v > [6/2j. Finally, assume that we have 
1 < v < [ 6 / 2 J. Then v[b/{v + 1)J > v(b/(v + 1) — l) > 6/2 — 1 > [6/2J — 1, 
because v(b/(v + 1 ) — l) — ( 6/2 — 1 ) = ( 6/2 — v — l)(v — l)/(v -(- 1) > 0. Since 
v\b/{v -)- 1 )J > [ 6 / 2 J — 1 , we must have v\b/{v + 1 )J > [ 6 / 2 J. 

24. The approximate probability is only log b 2, not (For example, if 6 = 2 35 , the 
probability is approximately ^; this is still high enough to warrant the special test for 
d = 1 in steps D 1 and D 8 .) 


25. 002 

ENTA 

1 

1 


003 

ADD 

V+l 

1 


004 

STA 

TEMP 

1 


005 

ENTA 

1 

1 


006 

J0V 

IF 

1 

Jump if vi = b — 1. 

007 

ENTX 

0 

1 


008 

DIV 

TEMP 

1 

Otherwise compute b/(v\ -j- 1). 

009 

J0V 

DIVBYZER0 

1 

Jump if v\ — 0. 

010 

1H STA 

D 

1 


Oil 

DECA 

1 

1 


012 

JANZ 

*+3 

1 

Jump if d ^ 1. 

013 

STZ 

U 

1 — A 


014 

JMP 

D2 

1 — A 
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015 

ENT1 

N 

A 

Multiply v by d. 

016 

ENTX 

0 

A 


017 

2H STX 

CARRY 

AN 


018 

LDA 

V.l 

AN 


019 

MUL 

D 

AN 

(as in exercise 13) 

026 

J1P 

2B 

AN 


027 

ENT1 

M+N 

A 

(Now rX = 0.) 

028 

2H STX 

CARRY 

A(M + N) 

Multiply u by d. 

029 

LDA 

U,1 

A(M + N) 

(as in exercise 13) 

037 

J1P 

2B 

A(M + N) 


038 

STX 

U 

A 

1 


26. (See the algorithm of exercise 16.) 

101 D 8 LDA D 1 (Remainder will be left in 

102 DECA 1 1 locations U+M+l through U+M+N) 

103 JAZ DONE 1 Terminate if d = 1. 

104 ENN1 N A rll = j — n — 1; j <— 1. 

105 ENTA 0 A r <- 0. 


106 

1H LDX 

U+M+N+1,1 AN 

rAX <- r 6 + u m +j- 

107 

DIV 

D 

AN 


108 

STA 

U+M+N+1,1 AN 


109 

SLAX 

5 

AN 

r 4 — (r 6 + Um+j) mod d. 

110 

INC2 

1 

AN 

J 4— j + 1 . 

111 

J2N 

IB 

AN 

-Repeat for 1 < j < n. | 


At this point, the division routine is complete; and by the next exercise, register AX is 
zero. 

27. It is du mod dv — d(u mod v). 

28. For convenience, let us assume that v has a decimal point at the left, i.e., v — 
(■ V 0 .V 1 V 2 ... )b- After step N1 we have \ < v < 1 -J- 1/6: for 

z v{*>+ !) v(l + 1 / 6 ) 1 

“ «i + l (l/6)(m + l) b' 

and 

6 + 1 v(6 + 1 — vi) 1 vi(b + 1 — Vi) 

v - > - >-. 

V\ + 1J V\ + 1 6 vi + 1 

The latter quantity takes its smallest value when v\ = 1, since it is a convex function 
and the other extreme value is greater. 

The formula in step N2 may be written 



v 


6(6 + 1) v 
vi + 1 6 ’ 


so we see as above that v will never become > 1 + 1 / 6 . 
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The minimum value of v after one iteration of step N2 is > 


6 (6-f 1)-Vi 
Vi + 1 


v f KH-i)— 

b vi -f- 1 J b 2 


b(b -b 1) + 1 — t 



~ l + \ + h 



6(6 + 1) +A 


if t = vi 1. The minimum of this quantity occurs for t = 6/2 + 1; a lower bound 
is 1 — 3/2 b. Hence v\ > 6 — 2, after one iteration of step N2. Finally, we have 
(1 — 3/26)(l -f- 1/6) 2 > 1, when b > 5, so at most two more iterations are needed. The 
assertion is easily verified when b < 5. 

29. True, since (u 3 ... Uj-\- n )b < v. 

30. In Algorithms A and S, such overlap is possible if the algorithms are rewritten 
slightly; e.g., in Algorithm A, we could rewrite step A2 thus: “Set t <— u 3 -|- v 3 -(- k, 
Wj <— tmodb, k <— [t/b\” 

In Algorithm M, v 3 may be in the same location as w 3 . In Algorithm D, it is most 
convenient (as in Program D, exercise 26) to let tt ... r n be the same as u m+ i ... u m +n', 
and we can also have qo.. .q m the same as uo ... u m , provided that no alteration of u 3 
is made in step D6. (Line 098 of Program D can safely be changed to “J1P 2B”, since 
u 3 isn’t used in the subsequent calculation.) 

31. Consider the situation of Fig. 6 with u = (ujUj+i ... u 3 -|_ n )3 as in Algorithm D. 
If the leading nonzero digits of u and v have the same sign, set r u — v, q <— 1; 

otherwise set r «— u-\-v, q < -1. Now if |r| > |uj, or if |r| — |uj and the first nonzero 

digit of Uj+n+i • •. Um+n has the same sign as the first nonzero digit of r, set q <— 0; 
otherwise set u 3 ... u 3 + n equal to the digits of r. 

36. Values to 1000 decimal and 1100 octal places have been computed by R. P. Brent, 
Comp. Centre Tech. Rep. 47 (Canberra: Australian Nat. Univ., 1975). 

37. Let d = 2 e so that b > dv i > 6/2. Instead of normalizing u and v in step Dl, 
simply compute the two leading digits v[v' 2 of 2 e (viU2f3)& by shifting left e bits. In 
step D3, use (u' 1; v 7 2 ) instead of (v 1} v 2 ) and (uJ,ttJ +1 ,ttJ +3 ) instead of {u 3 , u 3 + 1 , Uj + 2 ), 
where the digits u' J u' J+l u' J+2 are obtained from (ujUj+iUj+ 2 Uj+ 3 )b by shifting left e 
bits. Omit division by d in step D8. (In essence, u and v are being “virtually” shifted. 
This method saves computation when m is small compared to n.) 


SECTION 4.3.2 

1 . The solution is unique since 7 ■ 11 • 13 = 1001. The “constructive” proof of 

Theorem C tells us that the answer is ((11 • 13) 6 + 6 • (7 • 13) 10 —(- 5 • (7 - ll) 12 ) mod 1001. 
But this answer is perhaps not explicit enough! By (23) we have vi — 1, V 2 = (6 — 1) * 
8 mod 11 = 7, = ((5 — 1) • 2 — 7) • 6 mod 13 = 6, sou = 6-7-ll + 7-7 + l = 512. 

2. No. There is at most one such u\ the additional condition Ui = • • • ~ u r 
(modulo 1) is necessary and sufficient, and it follows that such a generalization is not 
very interesting. 
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3 . u = Ui (modulo rrii) implies that u = Ui (modulo gcd(m*, rrij)), so the condition 
u t = Uj (modulo gcd(mi, mj)) must surely hold if there is a solution. Furthermore if 
u = v (modulo rrij) for all j, then u — v is a multiple of lcm(mi,..., m r ) = m; hence 
there is at most one solution. 

The proof can now be completed in a nonconstructive manner by counting the 
number of different r-tuples {u\,... ,u r ) satisfying the conditions 0 < u 3 < m 3 and 
u x = u 3 (modulo gcd(mi, rrij)). If this number is m, there must be a solution since 
(umod mi ,..., u modm r ) takes on m distinct values as u goes from a to a m. 
Assume that u\,..., u r ~i have been chosen satisfying the given conditions; we must 
now pick u r = Uj (modulo gcd (rrij, m r )) for 1 < j < r, and by the generalized Chinese 
remainder theorem for r — 1 elements there are 

m r /lcm(gcd(mi, m r ),..., gcd(m r _i, m r )) — m r /gcd(lcm(mi,..., m r — i), m r ) 

- lcm(mi,..., m r )/lcm(mi,..., m r —i) 

ways to do this. [This proof is based on identities (10), (11), (12), and (14) of Section 
4.5.2.] 

A constructive proof [A. S. Fraenkel, Proc. Amer. Math. Soc. 14 (1963), 790-791] 
generalizing (24) can be given as follows. Let M 3 = lcm(mi,..., m 3 ); we wish to find 
u = v r M r —i -]-••• + V 2 M 1 -(- vi, where 0 < v 3 < Mj/Mj— 1 . Assume that V\ } ..., 
Vj —1 have already been determined; then we must solve the congruence 


VjM 3 —1 -f- v 3 —iMj —2 4--j- ifi = u 3 (modulo rrij). 

Here Vj—i M 3 — 2 -\ -j- Vi = Ui = Uj (modulo gcd(mi, m 3 )) for i < j by hypothesis, 

so c = Uj — (vj—iMj —2 ■ ■ • 4" i s a multiple of 

lcm(gcd(mi, m 3 ),..., gcd(mj_i, rrij)) = gcd(Mj—i, rrij) = d 3 . 

We therefore must solve VjM 3 —\ = c (modulo rrij). By Euclid’s algorithm there is a 
number c 3 such that c 3 M 3 —i = d 3 (modulo m 3 ); hence we may take 

V 3 = ( C j c)/dj mod ( m 3 /d 3 ). 


Note that, as in the nonconstructive proof, we have rrij/dj — Mj/Mj— 

4. (After m 4 = 91 = 713, we have used up all products of two or more odd primes 
that can be less than 100 , so m 5 , ... must all be prime.) 


m 7 = 79, mg = 73, mg = 71, 

mi 2 — 59, mi 3 = 53, mi 4 = 47, 

m 17 = 37, mis = 31, mig = 29, 

and then we are stuck (m 22 = 1 does no good). 
5. No. The obvious upper bound, 

3vt*u‘...= n 

p odd 
p prime 


mio = 
m is = 
m 20 = 


pUogp iooj 


67, 

43, 

23, 


mu = 61, 
mi 6 = 41, 
m 2 i = 17, 


is attained if we choose mi = 3 4 , m 2 = 5 2 , etc. (It is more difficult, however, to 

maximize mi... m r when r is fixed, or to maximize mi -|- \-m r as we would attempt 

to do when using moduli 2 mj — 1 .) 
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6 . (a) If e — f 4 kg, then 2 e = 2 f (2 9 ) k = 2 f • l k (modulo 2 9 — 1). So if 2 e = 2 f 

(modulo 2 g — 1), we have 2 emod? = 2 fmodg (modulo 2 9 — 1); and since the latter 
quantities lie between zero and 2 9 — 1 we must have emodg = / modg. (b) By part 

(a), (1 4 2 d H-h 2 (c_1)d ) • (2 e — 1) = (1 + 2 d H-h 2 (c-1)d ) • {2 d — 1) = 2 cd — 1 = 

2 ce — 1 = 2 1 — 1 = 1 (modulo 2 f — 1). 

7. (uj — (v! 4- mi(v 2 4- m 2 {v 3 -\ -1- my_ 2 vy-i)...))) Cij . ..C(y_i)y 

= (Uj V\)cij . . . C(y— i)y TYliV 2 Cij . . . C(y_i)y • • • 

m i... ... C(j—i)j 

= ^l) c lj • • • C{j — l)j V 2 C2j • ■ ■ C(j — i)y * • ■ Vj — 1 — l)j 

= (...((% — vi)cij —V 2 )C 2 3 -Vj—i)C(j-—i)j (modulo my). 

This method of rewriting the formulas uses the same number of arithmetic opera¬ 
tions and fewer constants; but the number of constants is fewer only if we order the 
moduli so that m\ < m 2 < • • • < m r , otherwise we would need a table of m % mod my. 
This ordering of the moduli might seem to require more computation than if we made 
mi the largest, m 2 the next largest, etc., since there are many more operations to be 
done modulo m r than modulo mi; but since v 3 can be as large as my — 1, we are better 
off with mi < m 2 < • • • < m r in (23) also. So this idea appears to be preferable to 
the formulas in the text, although the formulas in the text are advantageous when the 
moduli have the form (14), as shown in Section 4.3.3. 

8 . my_i . ..miVj = my_i . . .mi ( . . . ((uy — Vi)ciy — V 2 )c 2 j - Vj — 1 ) C(y—!)y = 

my_2 • • • mi (... (uj — v\)c\j — ••• — Vj— 2 )c(j— 2 )j — Vj—imj—2.. .mi = = 

Uj — Vi — v 2 m\ — ■ • • — Vj—imj —2 ... mi (modulo my). 

9. u r •«—((.. ■ (v r m r —1 4" v r—i)m r —2 ~\ -)mi 4 vi)modm r , ..., 

U 2 *— ( 1 ^ 2 mi —(— Vi) mod m 2 , ui +—f 1 mod mi. 

(The computation should be done in this order, if we want to let Uj and v 3 share 
the same memory locations, as they can in (23).) 

10. If we redefine the “mod” operator so that it produces residues in the symmetrical 
range, the basic formulas (2), (3), (4) for arithmetic and (23), (24) for conversion remain 
the same, and the number u in (24) lies in the desired range (10). (Here (24) is a balanced 
mixed-radix notation, generalizing “balanced ternary” notation.) The comparison of 
two numbers may still be done from left to right, in the simple manner described in 
the text. Furthermore, it is possible to retain the value u 3 in a single computer word, 
if we have signed magnitude representation within the computer, even if m 3 is almost 
twice the word size. But the arithmetic operations analogous to (11) and (12) are more 
difficult, so it appears that on most computers this idea would result in slightly slower 
operation. 

11. Multiply by 

m 4-1 (m 1 4 1 ra r 41^\ 

2 _ ^ 2 ’ 2 j 

Note that 2£• = t (modulo m). In general if v is relatively prime to m, then we can 

find (by Euclid’s algorithm) a number v' = (v\,, v ' r ) such that vv' ~ 1 (modulo m); 
and then if u is known to be a multiple of v we have u/v = uv', where the latter is 
computed with modular multiplication. When v is not relatively prime to m, division 
is much more difficult. 
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12. Obvious from (11), if we replace nrij by m. [Another way to test for overflow, if 
m is odd, is to maintain extra bits uq = umod2 and Vo = t>mod2. Then overflow has 
occurred iff uo+^o ^ u>i -)— -H-uv (modulo 2), where (wi, ..., w r ) are the mixed-radix 
digits corresponding to u-\-v.] 

13. (a) x 2 — x = ( x — l)x — 0 (modulo 10 n ) is equivalent to (x — l)x = 0 (modulo p n ) 
for p — 2 and 5. Either x or x — 1 must be a multiple of p, and then the other is 
relatively prime to p n \ so either x or x — 1 must be a multiple of p n . If imod2" = 
xmod5 n = 0 or 1, we must have xmodlO n = 0 or 1; hence automorphs have 
xmod2 n + xmod5 n . (b) If x = qp n -j- r, where r — 0 or 1, then r = r 2 = r 3 , 
so 3x 2 — 2x 3 = (6 qp n r -f 3r) — (6qp n r + 2 r) ~ r (modulo p 2n ). (c) Let c' be the magic 
constant (3(cx) 2 — 2(cx) 3 )/x 2 = 3c 2 — 2c 3 x. 

Note: Since the last k digits of an n-digit automorph form a fc-digit automorph, it 
makes sense to speak of the two oo-digit automorphs, x and 1 — x, which are 10-adic 
numbers (cf. exercise 4.1-31). The set of 10-adic numbers is equivalent under modular 
arithmetic to the set of ordered pairs ( u \, 112 ), where ui is a 2-adic number and U 2 is a 
5-adic number. 
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X 47: 

18 X 42: 

09 X 05: 

2718 X 4742: 

08 

04 

00 

1269 

08 

04 

00 

1269 

—15 

+14 

—45 

—0045 

49 

16 

45 

0756 

49 

16 

45 

0756 

1269 

0756 

0045 

12888756 


2* \JQ 4- [\fQ\ < \/Q + \[Q < \/Q + 2 \/Q -j- l = \/Q + so |_\/Q H- < 

[VQ\ +1- 

3. When k < 2, the result is true, so assume that k > 2. Let qk = 2 ® k , rk = 2 Rk , so 
that Rk = [\/Qk\ and Qk = Qk —1 + Rk— 1 . We must show that 1 4- (Rk + l)2 Rfc < 
2 Qk ~ 1 ; this inequality isn’t close at all, one way is to observe that 1 4 - (Rk + l)2 Hfc < 
14 _ 2 2fl fe and 2 Rk < Qk —1 when k > 2. (The fact that 2Rk < Qk —1 is readily proved 
by induction since Rk+i ~ Rk < 1 and Qk — Qk —1 > 2.) 

4. For j — 1, ..., r, calculate U e (j 2 ), jU 0 (j 2 ), V e (j 2 ), jV 0 (j 2 ); and by recursively 
calling the multiplication algorithm, calculate 

W(j) - (U e (f) + jUo(f))(V e (f)+jV 0 (f)\ 

m-j) = ( Ue(j 2 ) - jUo(j 2 ))(Ve(j 2 ) - jVo(j 2 )). 

Then we have W e (j 2 ) = i(W(j) + W(—j)) f W 0 (j 2 ) = \(W(j) — W(—j)). Also 
calculate W e (0) = U(0)V(0). Now construct difference tables for W e and W 0 , which are 
polynomials whose respective degrees are r and r — 1. 

This method reduces the size of the numbers being handled, and reduces the 
number of additions and multiplications. Its only disadvantage is a longer program 
(since the control is somewhat more complex, and some of the calculations must be 
done with signed numbers). 



588 ANSWERS TO EXERCISES 


4.3.3 


Another possibility would perhaps be to evaluate W e and W 0 at l 2 , 2 2 , 4 2 , ..., 
( 2 r ) 2 ; although the numbers involved are larger, the calculations are faster, since all 
multiplications are replaced by shifting and all divisions are by binary numbers of the 
form 2 3 (2 k — 1 ). (Simple procedures are available for dividing by such numbers.) 

5. Start the q and r sequences out with qo and q\ large enough so that the inequality 
in exercise 3 is valid. Then we will find in the formulas like those preceding Theorem C 

that we have r?i —► 0 and ry 2 — (l + l/(2rfc))2 1+v/2l3fe — \/ 2 Qk+i (Q k /Q k+1 ). The factor 
Qk/Qk+i —► 1 as k —► oo, so we can ignore it if we want to show that 772 < 1 — e for all 
large k. Now y/2Q k+1 = yj2Q k + 2\y/2&] + 2 > y/{2Q k + 2>JWk + 1) + 1 > 
\[2Qk + 1 + l/(3Kfc). Hence p 2 < (l + l/(2r fc ))2~ 1/(3/?fc) , and lg t? 2 <0 for large 
enough k. 

Note: Algorithm C can also be modified to define a sequence qo, qi, ... of a similar 
type that is based on n, so that n ^ q k + q k +i after step Cl. This modification leads 
to the estimate (19). 

6 . Any common divisor of 6 q d\ and 6 q -j- d 2 must also divide their difference 
d 2 — di. The (®) differences are 2, 3, 4, 6 , 8 , 1, 2, 4, 6 , 1, 3, 5, 2, 4, 2, so we must only 
show that at most one of the given numbers is divisible by each of the primes 2, 3, 5. 
Clearly only 6 q + 2 is even, and only Qq -f 3 is a multiple of 3; and there is at most 
one multiple of 5, since q k ^ 3 (modulo 5). 

7. Let pk—i < N < p k . We have t k < 6t k —i + c ^3 fc for some constant c; so t k /< 
tt-i/S 4 - 1 + cfc/2* < to + c^ ai (i/2 J ) = M. Thus tk < M ■ 6 fc = 0(^ og3<i ). 

8 . Let 2 k be the smallest power of 2 that exceeds 2 K. Set at «— u)~ t2 ^ 2 ut and 
b t «— oP K ~ 2—/2 , where u t — 0 for t > K. We want to calculate the convolutions 
c T = <J < r a ^r—j for r = 2 K — 2 — s, when 0 < s < K. The convolutions 

can be found by using three fast Fourier transformations of order 2 k , as in the text’s 
multiplication procedure. [Note that this device works for any complex number u, not 
necessarily a root of unity. See L. I. Bluestein, Northeast Electronics Res. and Eng. 
Meeting Record 10 (1968), 218-219.] 

9 . u s = U{qs) mod k- In particular, if q = —1 we get «(- r )modff, which avoids 
shuffling when computing inverse transforms. 

10 . A [j] (s fc _i,... , s k —j, tk—j— i,..., to) can be written 



(«0 — 1 ) 2 * ('t fc 

Lu 


— 1 i ■ ft fc - j 5 ^ 1 


;0...0)s 


( u tp u P \( y, 


and this is Y2 P q u p v q S(p, q), where S(p, q) = 0 or 2 3 . We have S(p, q) — 2 3 for exactly 
2 2k /2 3 values of p and q. 

11. An automaton cannot have 2 2 — 1 until it has c > 2, and this occurs first for 
Mj at time 3 j — 1. It follows that M 3 cannot have z 2 ziZo 7 ^ 000 until time 3 (j — 1). 
Furthermore, if Mj has zo 7 ^ 0 at time t, we cannot change this to Zo = 0 without 
affecting the output; but the output cannot be affected by this value of zo until at least 
time t + j — 1, so we must have t -)- j — 1 < 2n. Since the first argument we gave 
proves that 3 (j — 1 ) < t, we must have 4 (j — 1 ) < 2 n, that is, j — 1 < n/2, i.e., 
j < L^/2J + 1. This is the best possible bound, since the inputs u = v = 2 n — 1 
require the use of Mj for all j < [n/2j -j- 1. (For example, note from Table 1 that M 2 
is needed to multiply two-bit numbers, at time 3.) 



4.4 


ANSWERS TO EXERCISES 589 


12. We can “sweep through” K lists of MIX-like instructions, executing the first in¬ 
struction on each list, in 0(K + (N log N) 2 ) steps as follows: (1) A radix list sort 
(Section 5.2.5) will group together all identical instructions, in time 0(K -f~ N). (2) 
Each set of j identical instructions can be performed in O(logiV) 2 -f- 0(j) steps, and 
there are 0(N 2 ) sets. A bounded number of sweeps will finish all the lists. The remain¬ 
ing details are straightforward; for example, arithmetic operations can be simulated by 
converting p and q to binary. [SIAM J. Computing, to appear.] 

13. If it takes T(n) steps to multiply n-bit numbers, we can accomplish m-bit times 

n-bit multiplication by breaking the n-bit number into \n/rn\ m-bit groups, using 
|’n/m]T(m) 0{n -\- m) operations. The results of this section therefore give an 

estimated running time of 0(n log m log log m) on Turing machines, or 0(n log m) on 
machines with random access to words of bounded size, or O(n) on pointer machines. 


SECTION 4.4 


1. We compute (. 

• • (Um^rn — 

1 + CLm— 2 ) brr, 

t—2 " 

~b ai)bi -j- by adding and 

multiplying in the B 

j system. 






T. 

= 20(cwt. 

= 8(st. 

— 14(lb. — 16 oz.))) 

Start with zero 

0 

0 

0 

0 

0 

Add 3 

0 

0 

0 

0 

3 

Multiply by 24 

0 

0 

0 

4 

8 

Add 9 

0 

0 

0 

5 

1 

Multiply by 60 

0 

2 

5 

9 

12 

Add 12 

0 

2 

5 

10 

8 

Multiply by 60 

8 

3 

1 

0 

0 

Add 37 

8 

3 

1 

2 

5 

(Addition and multiplication by a constant in a mixed-radix system are readily done 

using a simple generalization of the usual carry rule; cf. exercise 

4.3.1-9.) 

2. We compute [u/Bq\, [[u/Bq\/Bi\, etc., 

and the remainders are Aq, Ai, etc. The 

division is done in the bj system. 





d. 

— 24(h. 

= 60(m. 

= 60 s.)) 


Start with u 

3 

9 

12 

37 


Divide by 16 

0 

5 

4 

32 

Remainder = 5 

Divide by 14 

0 

0 

21 

45 

Remainder = 2 

Divide by 8 

0 

0 

2 

43 

Remainder — 1 

Divide by 20 

0 

0 

0 

8 

Remainder — 3 

Divide by oo 

0 

0 

0 

0 

Remainder = 8 

Answer: 8 T. 3 cwt. 

1 st. 2 lb. 

5 oz. 





3. The following procedure due to G. L. Steele Jr. and Jon L. White generalizes 
Taranto’s algorithm for B = 2 originally published in CACM 2, 7 (July 1959), 27. 

Al. [Initialize.] Set M <— 0, do <— 0. 

A2. [Done?] If u < e or u > 1 — e, go to step A4. (Otherwise no M-place fraction 
will satisfy the given conditions.) 
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A3. [Transform.] Set M <— M +1, U—m <— [Bu\, u <— Bu mod 1, e <— Re, and return 
to A2. (This transformation returns us to essentially the same state we were in 
before; the remaining problem is to convert u to U with fewest radix-R places so 
that \U — u\ < e. Note, however, that e may now be > 1; in this case we could 
go immediately to step A4 instead of storing the new value of e.) 

A4. [Round.] If u > increase U—m by 1. (If u = \ exactly, another rounding rule 
such as “increase U—m by 1 only when it is odd” might be preferred.) | 

Step A4 will never increase U—m from B — 1 to B\ for if U—m — B— 1 we must have 
M > 0, but no (M — l)-place fraction was sufficiently accurate. Steele and White go 
on to consider floating-point conversions in their paper [to appear]. 

4. (a) l/2 te = 5 k /10 k . (b) Every prime divisor of b divides B. 

5. Iff 10 n — 1 < c < w, cf. (3). 

7. cm < ux < cm -f- u/w < au -(- 1, hence [cm] < \ux J < [au -f- lj. Furthermore, 
in the special case cited we have ux < au- f- a and [awj = [au -f- a — e \. 

8. ENT1 0 

LDA U 

1H MUL =1//10= 

3H STA TEMP 

MUL =-10= 

SLAX 5 
ADD U 

JANN 2F 

LDA TEMP (Can occur only on 
DECA 1 the first iteration, 

JMP 3B by exercise 7.) 

2H STA ANSWER, 1 (May be minus zero.) 

LDA TEMP 

INC1 1 

JAP IB | 

9. If x' is an integer, x — e < x' < x, then (1 -(- 1 /ri)x — ((1 + l/n)e + 1 — 1/n) < 
x' -(- [x'/n\ < (1 -(- 1 /n)x. Hence if a is the binary fraction satisfying 

1/10 — 2 — 35 < a = (. 000110011001100110011001100110011)2 < 1 / 10 , 

we find that au — e < v < au at the end of the computation, where 

e = |+ (.100010001010100011001000101010001)2 < §. 

Hence u /10 — 2 < u/10 — (e + (1/10 — a)u) < v < au < uf 10. Since v is an integer, 
the proof is complete. 

10. (a) Shift right one; (b) Extract left bit of each group; (c) Shift result of (b) right 
two; (d) Shift result of (c) right one, and add to result of (c); (e) Subtract result of 
(d) from result of (a). 
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11. 5.T 7 2 1 
— 1 0 

4 7.7 2 1 

— 9 4 

3 8 3.2 1 

— 7 6 6 

3 0 6 6.1 

— 6 13 2 

2 4 5 2 9 Answer: (24529)io- 

12. First convert the ternary number to nonary (radix 9) notation, then proceed as in 
octal-to-decimal conversion but without doubling. Decimal to nonary is similar. In the 
given example, we have 


1.7 6 4 7 2 3 
— 1 

1 6.6 4 7 2 3 
— 1 6 

1 5 0.4 7 2 3 

— 1 5 0 

1 3 5 4.7 2 3 

— 13 5 4 

1 2 1 9 3.2 3 

— 12 19 3 

1 0 9 7 3 9.3 

— 1 Q 9 7 3 9 

9 8 7 6 5 4 Answer: (987654)i 0 . 




9. 

.8 

7 

6 

5 

4 

+ 



9 






1 

1 

8. 

.7 

6 

5 

4 

+ 


1 

1 

8 





1 

3 

1 

6 

.6 

5 

4 

+ 


1 

3 

1 

6 




1 

4 

4 

8 

3. 

,5 

4 

+ 


1 

4 

4 

8 

3 



1 

6 

0 

4 

2 

8. 

,4 

+ 


1 

6 

0 

4 

2 

8 


1 7 6 4 7 2 3 Answer: (1764723) 9 . 


13. BUF ALF . . 

OEIG *+39 
START JOV 0FL0 

ENT2 -40 
8H ENT3 10 

1H ENT1 m 

ENTX 0 

2H STX CARRY 

J1P 2B 
SLAX 5 
CHAR 

STA BUF+40,2(2:5) 

STX BUF+41,2 
INC2 2 
DEC 3 1 
J3P IB 
OUT BUF+20,2(PRINTER) 
J2N 8B 


(Radix point on first line) 

Ensure overflow is off. 

Set buffer pointer. 

Set loop counter. 

Begin multiplication routine. 

(See exercise 4.3.1-13, with 
v ~ 10 9 and W = U.) 
rA «— next nine digits. 

Store next nine digits. 

Increase buffer pointer. 

Repeat ten times. 

Repeat until both lines printed. | 
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14. Let K(n ) be the number of steps required to convert an n-digit decimal number 
to binary and at the same time to compute the binary representation of 10 n . Then we 
have K(2n) < 2 K(n) 0(M(n)). Proof: Given the number U = (u 2 n-i---Uo)io, 
compute Ui — (u 2 n —i • • • u n )io and Uo = (u n —i ...110)10 and 10 n , in 2 K(ri) steps, 
then compute U = 10 n Pi + ^0 and 10 2n = 10 n ■ 10 n in 0(M(n)) steps. It follows that 
K(2 n ) = 0{M(2 n ) + 2M(2 n—1 ) + 4M(2 n “ 2 ) + •••) = 0{nM{2 n )). 

[Similarly, Schonhage has observed that we can convert a (2 n lg 10)-bit number U 
from binary to decimal, in 0(nM( 2 n )) steps. First form V = 10 2 ” 1 in 0(M(2 n ~ l )-\- 
M(2 n ~ 2 ) -}-•••) — 0(M(2 n )) steps, then compute Uo = ( U mod"F) and U\ = \ U/V\ 
in 0(M(2 n )) further steps, then convert Uo and U\] 

18. Let U — rounds(u,P) and v — roundel/, p). We may assume that 11 ^ 0, so 
that U 7 ^ 0 and v 7^ 0. Case 1, v < u: Determine e and E such that b e ~ 1 < u < b e , 
B e ~ 1 < U < B e . Then u < U + \B E ~ P and U < u — %b e ~ p ; hence B p ~ 1 < 
B p ~ e U < B p ~ e u < b p ~~ e u < b p . Case 2 , v > u: Determine e and E such that 
6 e—1 < u < b e , B E ~ l <U < B e . Then u > U — \B E ~ P and U >u+ \b e ~ p \ 
hence B p ~ 1 < B P ~ E {U — B E ~ P ) < B p ~ E u < b p ~ e u < b p . Thus we have proved 
that B p ~ l < b p whenever 1/7 - u. 

Conversely, if B p ~ x < b p , the above proof suggests that the most likely example 
for which Uy^v will occur when u is a power of b and at the same time it is close to a 
power of B. We have B p ~ x b p < B p ~ l b p + \b p -\B p ~ l - \ = (B p - X + \){b p - i); 
hence 1 < a = 1/(1 — \b ~ p ) < 1 + F = ft. There are integers e and E such 
that log B ot < elog B fe — E < log B (3, since Weyl’s theorem (exercise 3.5-22) implies 
that there is an integer e with 0 < log fl a < (e log B b) mod 1 < log B /3 < 1 when log B b 
is irrational. Hence a < b e /B E < 0, for some e and E. (Such e and E may also be 
found by applying the theory of continued fractions, see Section 4.5.3.) Now we have 
rounds(6 e ,P) — B E , and roundb(P E ,p) < b e . [CACM 11 (1968), 47-50; Proc. Amer. 
Math. Soc. 19 (1968), 716-723.] 

19. mi = (FOFOFOFO)ie, ci = 1 — 10/16 makes U = ((^7^)10 • ■ • (^1^0)10)256; then 
m2 = (FFOOFFOO)ie, C2 = 1 — 10 2 /16 2 makes U = ((117^6^5^4)10(^3^2^1^0)10)65536; 
and m3 = (FFFF0000)i6, C3 = 1 — 10 4 /16 4 finishes the job. (Cf. exercise 14. This 
technique is due to Roy A. Keir, circa 1958.) 


SECTION 4.5.1 

1. Test whether or not uv' < u'v, since the denominators are positive. 

2. If c > 1 divides both u/d and v/d, then cd divides both u and v. 

3. Let p be prime. If p e is a divisor of uv and u'v' for e > 1, then either P e \u and 
p e \ v' or p e \ u' and p e \ v; hence p e \ gcd(u, ii / )gcd(u / , t>). The converse follows by 
reversing the argument. 

4. Let di ~ gcd(u,v), d 2 — gcd(u / ,P); the answer is w — (u/di)(v'/d 2 )sign(v), 
w 1 = \(u'/d 2 ){v/d\)\, with a “divide by zero” error message if v — 0. 

5. di = 10, t = 17 ■ 7 — 27 • 12 = —205, d 2 = 5, w = —41, w' — 168. 

6. Let u" = u'/di, v" = v'/di; our goal is to show that gcd(ui>" + u"v,d\) = 
gcd(uv" -f* u " v i d\u"v"). If p is a prime that divides u", then p does not divide u or v", 
so p does not divide uv" -\- u"v. A similar argument holds for prime divisors of v", so 
no prime divisors of u"v" affect the given gcd. 
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7. (N — l) 2 -f- (N — 2) 2 = 2 N 2 — (6 N — 5). If the inputs are n-bit binary numbers, 
2n -f-1 bits may be necessary to represent t. 

8. For multiplication and division these quantities obey the rules x/0 = sign(x)oo, 
(T°°) X x = x X (± 00 ) = (T 00 )/^ — = j = sign(x)oo, x/(T°°) = 0, provided that x 
is finite and nonzero, without change to the algorithms described. Furthermore, the 
algorithms can readily be modified so that 0/0 = 0 x (± 00 ) - (± 00 ) X 0 = “(0/0)”, 
where the latter is a representation of “undefined”; and so that if either operand is 
“undefined” the result will be “undefined” also. 

Since the multiplication and division subroutines can yield these fairly natural 
rules of “extended arithmetic,” it is sometimes worthwhile to modify the addition and 
subtraction operations so that they satisfy the rules x-J -00 — ioo, xT(— 00 ) = Too, 
for x finite; (Too) -|- (Too) — Too — (Too) — T°°; furthermore (T°°) T (T°o) — 
(T°o) — (T°°) — (0/0); and if either or both operands are (0/0), the result should 
also be (0/0). Equality tests and comparisons may be treated in a similar manner. 

The above remarks are independent of “overflow” indications. If 00 is being used 
to suggest overflow, it is incorrect to let l/oo be equal to zero, lest inaccurate results 
be regarded as true answers. It is far better to represent overflow by (0/0), and to 
adhere to the convention that the result of any operation is undefined if at least one of 
the inputs is undefined. This type of overflow indication has the advantage that final 
results of an extended calculation reveal exactly which answers are defined and which 
are not. 

9. If u/u' y£v/v', then 

1 < |uv — u'v | — v!v'\{u/u') — ( v/v)\ < |2 2 n {u/u') — 2 2 n {v/v')\\ 

two quantities differing by more than unity cannot have the same “floor.” (In other 
words, the first 2n bits to the right of the binary point are enough to characterize the 
value of the fraction, when there are n-bit denominators. We cannot improve this to 
2n — 1 bits, for if n = 4 we have ^ = (.00010011... ) 2 , T = ( 00010010... ) 2 .) 

11. To divide by (v T v'y/S)/v", when v and v' are not both zero, multiply by the 
reciprocal, (v — v'y/S )v"/(v 2 — 5u' 2 ), and reduce to lowest terms. 

12. One idea is to limit numerator and denominator to a total of 27 bits, where we need 
only store 26 of these bits (since the leading bit of the denominator can be assumed 1). 
This leaves room for a sign and five bits to indicate the denominator size. Another 
idea is to use 28 bits for numerator and denominator, which are to have a total of at 
most seven hexadecimal digits, together with a sign and a 3-bit field to indicate the 
number of hexadecimal digits in the denominator. 

[Using the formulas in the next exercise, the first alternative leads to exactly 
2140040119 finite representable numbers, while the second leads to 1830986459. The 
first alternative is preferable because it represents more values, and because it is 
‘cleaner’ and makes smoother transitions between ranges.] 

13. The number of multiples of n in the interval (a, b] is \n/a\ — [a/ n J- Hence, by 

inclusion and exclusion, the answer to this problem is So — Si T S 2 — • • •, where Sk 
is /P\([N 2 /P\ — L-Ni/PJ), summed over all products P of k distinct primes. 


SECTION 4.5.2 

1. Substitute min, max, T consistently for gcd, 1cm, X, respectively. 
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2. For prime p, let u p , vi p , ..., v np be the exponents of p in the canonical factoriza¬ 
tions of u, Vi, ..., v n - By hypothesis, u p < V\ p + ••• + v np . We must show that 

Up < min(up,i>i p ) -|--1- min(w p , v np ), and this is certainly true if u p is greater than 

or equal to each v 3P , or if u p is less than some v 3P . 

3. Solution 1 : A one-to-one correspondence is obtained if we set u = gcd(d, n) and 
v = n 2 /lcm(d, n) for each divisor d of n 2 . Solution 2: If n = p\ l .. .p e r r , the number 
in each case is (2ei -|- 1)... (2e r + 1). 

4. See exercise 3.2.1.2-15(a). 

5. Shift u and v right until neither is a multiple of 3, remembering the proper power 
of 3 that will appear in the gcd. Each subsequent iteration sets t ■*— u-\-v or t «— u — v 
(whichever is a multiple of 3), shifts t right until it is not a multiple of 3, then replaces 
max(u, v ) by the result. 


u 

V 

t 

13634 

24140 

10506, 3502; 

13634 

3502 

17136, 5712, 1904; 

1904 

3502 

5406, 1802; 

1904 

1802 

102, 34; 

34 

1802 

1836, 612, 204, 68; 

34 

68 

102, 34; 

34 

34 

0. 


The evidence that gcd(40902,24140) = 34 is now overwhelming. 

6 . The probability that both u and v are even is the probability that both are 
multiples of four is etc. Thus A has the distribution given by the generating 
function 

3,3,32, 3/4 

— —j— — z —z -4- • • • —- - —. 

4 16 64 1 — 2/4 

The mean is 5 , and the standard deviation is + £ — £ = §. If u and v are inde¬ 
pendently and uniformly distributed with 1 < u, v < 2 N , then some small correction 
terms are needed; the mean is then actually 

(2 N — 1)~ 2 ^2 (2^ -fc — l) 2 = £ — 1(2^ — l) -1 + JV(2" — 1)~ 2 . 

1 < fc < AT 

7. When u, v are not both even, each of the cases (even, odd), (odd, even), (odd, 
odd) is equally probable, and B = 1, 0, 0 in these cases. Hence B — | on the average. 
Actually, as in exercise 6 , a small correction could be given to be strictly accurate when 
1 < u, v < 2 JV ; the probability that B = 1 is actually 

( 2 N —l )- 2 ^2 ( 2 w-fc — l) 2 iV_fc = | — ±( 2 n — l) -1 . 

1 < fc < N 

8 . The quantity E is the number of subtraction cycles in which u > v, plus one if u 
is odd after step Bl. If we change the inputs from ( u , v) to (v,u), the value of C stays 
unchanged, while E becomes C — E or C — E — 1 ; the latter case occurs iff u and v 
are both odd after step Bl, and this has probability 3 -|- %/(2 N — 1). Hence 

Eorve — Cave Eave 


1 

3 


1 /( 2 "- 1 ). 
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9. The binary algorithm first gets to B6 with u = 1963, v — 1359; then t <— 604, 302, 
151, etc. The gcd is 302. Using Algorithm X we find that 2 ■ 31408 — 23 • 2718 = 302. 

10. (a) Two integers are relatively prime iff they are not both divisible by any prime 
number, (b) Rearrangement of the sum in (a), in terms of the denominators k = 
pi...p r . (Note that each of the sums in (a) and (b) is actually finite.) (c) Since 
(n/k ) 2 — [n/k\ 2 = 0 {n/k), we have q n — Ei < k <n^ k )( n / k f = Ei< fc < n 0 (n/k) = 

0(nH n ). Furthermore ^2 k>n {n/k ) 2 = 0(n), (d) Ed\n p(d) = [In fact, we have 
the more general result 


£<«©■ - 




J 


as in part (b), where the sums on the right are over the prime divisors of n, and this is 
equal to n s (l — 1 /pf)... (1 — 1 /p*) if n = p \ x ... p e r T .] 

Notes: Similarly, we find that a set of k integers is relatively prime with probability 
l/(En>i l/ nfc )- This proof of Theorem D is due to F. Mertens, J. fur die reine 
und angew. Math. 77 (1874), 289-291. The technique actually gives a much stronger 
result, namely that 67 r~ 2 mn -f- O(nlogra) pairs of integers u (E [/(ra), f(m) -f- ra), 
v (E [t?(n), g{n) + n) are relatively prime, for arbitrary / and g, when ra < n. 

11. (a) 6 / 7 r 2 times 1 \ -f- namely 49 /( 67 r 2 ) .82746. (b) 6 / 7 T 2 times 1/1 -f- 2/4-[- 
3/9 + • • ■, namely 00 . (This is true in spite of the result of exercise 12, and in spite 
of the fact that the average value of lngcd(it, v) is a small, finite number.) 

12. Let 0 (n) be the number of positive divisors of n. The answer is 


£ a(k) ■ Jis 

k> 1 



7T 


2 


6 


[Thus, the average is less than 2, although there are always at least two common divisors 
when u and v are not relatively prime.] 

13 -1 + i + jg H— — 1 + i + iH-+ \ + g 4— )■ 

14. v\ — ±v/u$, Vi = ^u/uz (the sign depends on whether the number of iterations 
is even or odd). This follows from the fact that v\ and V 2 are relatively prime to each 
other (throughout the algorithm), and that v\u — — V 2 V. [Hence v\u = lcm(u, v) at 
the close of the algorithm, but this is not an especially efficient way to compute the 
least common multiple. For a generalization, see exercise 4.6.1-18.] 

G. E. Collins has observed that |ui| < \v/u$, |u 2 | < \u/u 3 , at the termination 
of Algorithm X, except in certain trivial cases, since the final value of q is usually > 2. 
This bounds the size of Inil and \u 2 \ throughout the execution of the algorithm. 

15. Apply Algorithm X to v and m, thus obtaining a value x such that xv = 1 
(modulo m). (This can be done by simplifying Algorithm X so that u 2 , V 2 , and £2 are 
not computed, since they are never used in the answer.) Then set w <— uxmodm. 
[It follows, as in exercise 30, that this process requires 0(n 2 ) units of time, when it is 
applied to large n-bit numbers. An alternative to Algorithm X appears in exercise 35.] 

16. (a) Set t\ — x-{-2y-\-3z; then 3ti-\-y~\-2z = 1, 5ti — 3 y — 20 z = 3. Eliminate y, 
then 14ti — lAz — 6 : No solution, (b) This time 14fi — 14 2 = 0. Divide by 14, 
eliminate ti; the general solution is x = 8z — 2, y = 1 — 5z, z arbitrary. 
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17. Let ux, U2, U3, vi, V2, V3 be multiprecision variables, in addition to u and v. 
The extended algorithm will act the same on u 3 and V3 as Algorithm L does on u 
and v. New multiprecision operations are to set t <— Auj, t «— t -f- Bvj, w <— Cu 3 , 
w *— w -\- Dvj, Uj 4 — t, v 3 *— w for all j, in step L4; also if B — 0 in that step to set 
t 4— Uj — qvj, Uj 4— Vj, Vj 4— t for all j and for q — [w 3 /u 3 J. A similar modification is 
made to step LI if V3 is small. The inner loop (steps L2 and L3) is unchanged. 

18. If mn = 0, the probabilities of the lattice-point model in the test are exact, so we 
may assume that m > n > 0. Valida vi, the following values have been obtained: 

Case 1, m = n. From (n, n) we go to (n — t, n ) with probability t/2 i — 5/2 t+1 -}- 
3/2 2£ , for 2 < t < n. (These values are yg, ... .) To (0, n) the probability 

is n/2 n—1 — l/2 n -f- l/2 2n—2 . To (n, k) the probability is the same as to ( k,n ). The 
algorithm terminates with probability 1/2^ 1 . 

Case 2,m = n-f- 1 . From (n-f-1, n ) we get to (n, n) with probability | when n > 1 , 
or 0 when n — 1; to (n — t, n ) with probability ll/2 t+3 — 3/2 2t+1 , for 1 < t < n — 1. 
(These values are 3 ^, ... .) We get to (l,n) with probability 5/2 n+1 — 3/2 2ri 1 , 

for n > 1; to (0, n) with probability 3/2 n — 1 /2 2n 1 . 

Case 3, m > n 2. The probabilities are given by the following table: 


(m — 1, n): 

(m — t, n): 

(m — n, n ): 

(:m — n — 1, n): 
(m — n — t, n ): 
(0,n): 


1/2 - 3 / 2™-^+ 2 - ( 5 nl / 2 m+1 ; 

1 / 2 4 -f 3 / 2 m “ n+t+1 , 1 < t < n; 

l/2 n -j- l/2 m , n > 1; 

l/2 n+1 + l/2 m ~ 1 ; 

l/2 n+t , 1 < £ < m — n; 

l/2 m -\ 


Note: Although these exact probabilities will certainly improve on the lattice- 
point model considered in the text, they lead to recurrence relations of much greater 
complexity; and they will not provide the true behavior of Algorithm B, since for 
example the probability that gcd(u, v) = 5 is different from the probability that 
gcd(w, v) — 7. 

19. A n + 1 = a + J2l<k<n 2 fc A(n+l)(n-fc) + 2 n b — a+Y2l<k<n2 fc An(n-fc) + 
^c(l — 2 ~ n ) -(- 2 ~ n b = a -\- ^A n ( n _i) %(A n — a) -j- ^c(l — 2~ n ); now substitute 
for An{n—i) from (36). 

20. The paths described in the hint have the same probability, but the subsequent 
termination of the algorithm has a different probability; thus X = A:+l with probability 
2~ k times the probability that X = 1. Let the latter probability be p. We know from 
the text that X = 0 with probability « §; hence f ^p(l-t-^ + 4 + '") :=: 2p. The 

average is p(l + \ + f + f H-) = p(l + h + \ + g H- f = 4 p. [The exact 

probability that X = 1 is g — §(—\) n if m > n > 1, | — ^(—|) n if m — n > 2.] 

21. Show that for fixed v and for 2 m < u < 2 m+1 , when m is large, each subtraction- 
shift cycle of the algorithm reduces [lg u\ by two, on the average. 

22. Exactly (N — m)2 m ~ 1 + <5m0 integers u in the range 1 < u < 2 N have |_lg xaJ = m, 
after u has been shifted right until it is odd. 
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23. The first sum is 2 2N 2 Ylo<m<n<N mn ^ m n (( a + /5)A r +7 — am — /3n). Since 

^2 m2~ m = 2 — (n + l)2 1_n 

0 < m < n 

and _ 

m(m—l)2~ m = 4— (n 2 + n + 2)2 1—n , 

0<m<n 

the sum on m is 

2 2N ~ 2 n2- n (( 1 -a + (a + p)N)(2-(n + l)2 1 - n ) 

0 <n<iV 

— a(4 — (n 2 + n + 2)2* _n ) — 0n) 

= 2 2N ~ 2 {(a + 0)N n2-”(2 - (n + 1)2 1 -") + 0(1)). 

0< n< AT 


Thus the coefficient of (a + /3)iV in the answer is found to be 2 ~ 2 (4 — (|) 3 ) == A 
similar argument applies to the other sum. 

Note: The exact value of the sums may be obtained after some tedious calculation 
by means of the general formula 


k ~ zk 

0< k < n 


ml z m 
(1 — ^) m + 1 


E 

0 < k <m 


m fc n m = ic z n + fc 
(1 - Z )^ 1 


which follows from summation by parts. 

24. Solving a recurrence similar to (34), we find that the number of times is A mn , 

where A$o = 1, Ao n — (n -f- 3)/2, A nn = f — (3?i -j- 13)/(9 • 2 n ) —\) n if tl > 1, 

A mn = | — 2/(3 • 2 n ) -f- y|(—|) n if m > n > 1. Since the condition u = 1 or = 1 
is therefore satisfied only about 1.6 times in an average run, it is not worth making 
the suggested test each time step B5 is performed. (Of course the lattice model is not 
completely accurate, but it seems reasonable to believe that it is not too inaccurate for 
this application.) 

25. (a) F n+ i(x) = £ d >i 2~ d probability that (X„ < 1 and 2 d /(X~ 1 — 1) < x or 
X n > land(X„-l)/2‘ i < i) = Ed>i 2- < ‘(K(l/(l+2 d i-‘))+l!’„(l+2 ,i i)-R l (l)). 
(b) G„+i(i) = 1 - Ed>, 2~ d (G n ( 1/(1 + 2 d x)) - G„(l/(1 + 2 d x~ 1 ))). (c) H n (x) = 
^ ~2 d>1 2 —d probability that ( Y n < x and (1 — Y n )/2 d < x) can be transformed into 
J2d>i 2 _d max(0, G„(x) — G„(l — 2‘*x)). 

Starting with G 0 (i) = iwe get rapid convergence to a limiting distribution where 


(G(.l),..., G(.9)) = (.2750, .4346, .5544, .6507, .7310, .7995, .8590, .9114, .9581). 


The expected value of ln(max(w n , D n )/max(u n +i, i> n +i)) is fg H n {t)dt/t, and Brent 
has shown that this can be written 



t 



G n {t ) 

1 — t 


dt -f 



l/(l + 2 fc ) 
/(l + 2 fc+l) 


G n (t) 
t( 1 - t) 


dt. 
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26. By induction, the length is m-\-[n/ 2J when m > n, except that when m = n = 1 
there is no path to (0,0). 

2T. Let a n = (2 n — (—l) n )/3; then a 0 , a u a 2 , ... = 0, 1, 1, 3, 5, 11, 21, 

(This sequence of numbers has an interesting pattern of zeros and ones in its binary 
representation. Note that a n — a n —i + 2a n _2, and a n + a n +i = 2 n .) For m > n, 
let u — 2 7n_l " 1 — o n _(_ 2 , v — a n -f 2 - For m = n > 0, let u = a n +2 and v — 2a n +i, 
oru = 2a n _)_i and i> = a n+2 (depending on which is larger). Another example for the 
case m = n>0isu = 2 n+1 — 1, v = 2 n+1 — 2; this choice takes more shifts, and 
gives C = n 1, D = 2n, E = 1. 

28. This is a problem where it appears to be necessary to prove more than was asked 
just to prove what was asked. Let us prove the following: If u and v are positive integers, 
Algorithm B does < 1 -f- |_lgmax(u, w)J subtraction steps; and if equality holds, then 

|_lg(u + v)J > 

For convenience, let us assume that u > v; let m — |_lgu_|, n = [lgvj; and let us 
use the “lattice-point” terminology, saying that we are “at point (m, n).” The proof is 
by induction on m-\- n. 

Case 1, m — n. Clearly, [lg(u -f- v)J > [lgwj in this case. If u = v the result is 
trivial; otherwise the next subtraction-shift cycle takes us to a point (m — k, m). By 
induction, at most m +1 further subtraction steps will be required; but if m -f-1 more 
are needed, we have [lg({u — v)2~ r + ti)J > [lg vj, where r > 1 is the number of right 
shifts that were made. This is impossible, since (u — v)2~ r -f v < (u — v) -f- v = u. 
So at most m further steps are needed. 

Case 2, m > n. The next subtraction step takes us to (m — k, n), and at most 
1 + max(m — k, n) < m further steps will be required. Now if m further steps are 
required, then u has been replaced by u' = (u — v)2~ r for some r > 1. By induction, 
[lg (u' -(- v)J > m; hence 

[lg(u -f v)J = |_lg 2((tx — v)/2 -f v)J > [lg 2(u -f v)J > m + 1 > [lguj. 

29. Subtract the fcth column from the 2/cth, 3fcth, Akth, etc., for k = 1, 2, 3, ... . The 

result is a triangular matrix with Xk on the diagonal in column k, where m — Yl d \m Xd • 
It follows that x m = so the determinant is <p{l)<p{2 )... <p{n). 

[In general, “Smith’s determinant,” in which the (i,j) element is f{gcd(i,j)) for an 
arbitrary function /, is equal to Ill <m <n^d\m by the same argument. 

See L. E. Dickson, History of the Theory of Numbers 1 (New York: Chelsea, 1952), 
122-123.] 

30. To determine A and r such that u = Av -j- r, 0 < r < v, using ordinary long 
division, takes 0((1 -f-log A.)(log u)) units of time. If the quotients during the algorithm 
are Ai, A 2 , ..., A m , then AiA 2 ... A m < u, so logAi -f • • • + logA m < logw. Also 
m = O(logu). 

31. In general, since ( a u — l)mod(a w — 1) = a wmodv — i ( c f gq 4.3.2-19), we find 
that gcd(a m — 1 ,a n — 1) = a gcd(m,n) — ^ j> or a g p 0S itj V e integers a. 

32. Yes, to O(n(logn) 2 (loglogn)), even if we also need to compute the sequence of 
partial quotients that would be computed by Euclid’s algorithm; see A. Schonhage, 
Acta Informatica 1 (1971), 139-144. [But Algorithm L is better in practice unless n is 
extremely large.] 
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34. Keep track of the most significant and least significant words of the operands (the 
most significant is used to guess the sign of £ and the least significant is to determine the 
amount of right shift), while building a 2 X 2 matrix A of single-precision integers such 
that A{^) — (^/^), where w is the computer word size and where u' and v' are smaller 
than u and v. (Instead of dividing the simulated odd operand by 2 , multiply the other 
one by 2, until obtaining multiples of w after exactly lg ui shifts.) Experiments show 
this algorithm running four times as fast as Algorithm L, on at least one computer. 

35. (Solution by Michael Penk.) 

Yl. [Find power of 2.] Same as step Bl. 

Y2. [Initialize.] Set (ui, 142 ,^ 3 ) <— (1,0, u) and ( vi,v 2 ,v 3 ) <— {v, 1 — u, v). If u is odd, 
set (£ 1 , £ 2 , £3) *- (0, —1, —v) and go to Y4. Otherwise set (£ 1 , £ 2 , £3) <— (1,0, u). 

Y3. [Halve £3.] If £1 and £2 are both even, set (£i,£2,£s) <— (£1, £2, £s)/ 2 ; otherwise set 
(£1, £2, £3) <— (£1 -j- v, £2 — u, t 3 )/2. (In the latter case, £1 -f- v and £2 — u will both 
be even.) 

Y4. [Is £3 even?] If £3 is even, go back to Y3. 

Y5. [Reset max(u3, U3).] If £3 is positive, set (ui, 11,2,11,3) <— (£1 , £2 , £3); otherwise set 
{vi,v 2 ,v 3 ) <- (v — £1,— u — £2,— £3). 

Y6. [Subtract.] Set (£ 1 , £ 2 , £3) «— («i, u 2 , u 3 ) — (v x , v 2 , ^3)- Then if £1 < 0, set(£i,£ 2 )<— 
(£1 -j- v, £2 — u). If £3 0, go back to B3. Otherwise the algorithm terminates 

with (tii, u 2 , u 3 • 2 k ) as the output. | 

It is clear that the relations in (16) are preserved, and that 0 < < v , 

0 > u 2 ,V2,t 2 > — u after each of steps Y2-Y6. If u is odd after step Y2, then step 
Y3 can be simplified, since £1 and t 2 are both even iff t 2 is even; similarly, if v is odd, 
then £1 and £2 are both even iff £1 is even. Thus, as in Algorithm X, it is possible to 
suppress all calculations involving U2, V2, and £ 2 , provided that v is odd after step Y2. 
This condition is often known in advance (e.g., when v is prime and we are trying to 
compute u~ 1 modulo v). 


SECTION 4.5.3 

1. The running time is about 19.02T -f- 6, just a trifle slower than Program 4.5.2A. 

Qn(^ 1, ^2, • • • , In—1, ^n) Qn —l^l, ^2, • < • , %n — l) 

Qn — l(X2, • • • , In— 1, X71) Qn —2(^2, • • ■ , In— l) 

3 . Qn(% 1, . . . , I n ). 

4. By induction, or by taking the determinant of the matrix product in exercise 2. 

5. When the z’s are positive, the q’s of (9) are positive, and q n -\-1 > q n — 1 ; hence 
(9) is an alternating series of decreasing terms, and it converges iff q n Qn -\-1 —► 00 . By 
induction, if the x’s are greater than e, we have q n > c(l -j- e/2) n , where c is chosen 
small enough to make this inequality valid for n = 1 and 2. But if x n = l/2 n , we have 
q n < 2 - l/2 n . 

6. It suffices to prove that A\ = B\, and from the fact that 0 < /x 1 ,..., x n / < 1 
whenever xi,..., x n are positive integers, we have B\ = [1/XJ = A\. 
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7. Only 1 2 ... n and n... 2 1. (The variable x k appears in exactly F k F n -k terms; 
hence x\ and x n can only be permuted into x\ and x n . If x\ and x n are fixed by the 
permutation, it follows by induction that X 2 , ... > x n —i are also fixed.) 

8. This is equivalent to 

Qn—2{An— 1, • • • , A 2 ) —XQ n __i(An—1, • • • , Ai) _ 1 

Qn-i(A n ,..., Aa) - XQ n (A n ,..., Ai) ~~ X n ’ 

and by (6) this is equivalent to 

_ Qn— l(A2, • ■ ■ , A n ) -j- X n Q n — 2 ^ 2 , . . . , A n _i) 

Qn(Ai, . . . , A n ) -)- XnQn—l{A\, . . . , A n —l) 

9. (a) By definition, (b), (d) Prove this when n — 1, then apply (a) to get the result 
for general n. (c) Prove when n = k -f 1, then apply (a). 

10 . If Ao > 0 , then Bo — 0 , Bi — Ao, B 2 — Ai, B 3 — A 2 , B 4 — A 3 , B 5 — A 4 , 
771 = 5. If Ao = 0, then Bo = Ai, B\ A 2 , B 2 — - A 3 , B 3 — A 4 , m r — 3. If Ao = —1 
and Ai — 1 , then Bo = —(A 2 -f 2 ), Bi = 1 , B 2 — A 3 — 1, B 3 — A 4 , m — 3. If 
Ao = —1 and Ai > 1 , then Bo = — 2 , B 2 = Ai — 2, B 3 = A 2 , B 4 = A 3 , B 5 = A 4 , 
m — 5. If Ao <C —1, then Bo — —1, Bi = 1 , B 2 = —Ao — 2 , B 3 = 1 , B 4 — Ai — 1 , 
B 5 = A 2 , Bo — A 3 , Bt ~ A 4 . [Actually, the last three cases involve eight subcases; 
if any of the B’s is set to zero, the values should be “collapsed together” by using 
the rule of exercise 9(c). For example, if Ao = —1, Ai = A 3 = 1, we actually have 
B 0 — —(A 2 + 2 ), Bi = A 4 + I, m — 1 . Double collapsing occurs when Ao = —2, 
Ai = l.] 

11. Let q n = Qn(A \, ■ ■ ■ > A n ), q n = Q n (B \, . . - , Bn), Pn — Qn+l(Ao, - . . , An), Pn ~ 
Qn+i(B 0 ,... ,B n ). By (5) and (11) we have X = (p m -f p m -iX m )/(q m -j“ Qrn. —lXA), 
Y — (pn + Pn—iPn)/(qn + Qn—i^n); therefore if X m = Pn, the stated relation between 
X and Y holds by (8). Conversely, if X — (qY + x)/(sF -f" £)> )<?£ — rsj = 1, we may 
assume that s > 0, and we can show that the partial quotients of X and Y eventually 
agree, by induction on s. The result is clear when 5 = 0, by exercise 9(d). If s > 0, 
let q = as -f- s', where 0 < s' < s. Then X = a -j- l/((sP -j- t)/(s'Y -p r — at)); since 
s(r — at) — ts' = sr — tq, and s' < s, we know by induction and exercise 10 that the 
partial quotients of X and Y eventually agree. [Note; The fact that m is always odd 
in exercise 10 shows, by a close inspection of this proof, that X TO = Y n if and only if 
X = (qY A r)l($Y -j- t), where qt — rs = (—l) m ~ n .] 

12. (a) Since V n V n +\ — D — U%, we know that D — is a multiple of V n +i; 

hence by induction X n = (Vd — U n )/V n , where U n and V n are integers. [Note that 
the identity V n +i — A n (U n ~ 1 — U n ) + V n —i makes it unnecessary to divide when 
Vn+i is being determined.] 

(b) Let Y — (—Vd — U)/V, Y n = (—VD — U n )/V n • The stated identity 
obviously holds by replacing \[D by —\[D in the proof of (a). We have 

Y = ( Pn/Y n -j- Pn~l)/(qn/Y n -p q n —x), 
where p n and q n are defined in part (c) of this exercise; hence 

Y n =: ( qn( qn — 1 )(Y p n j qn) j (^Y Pn —1 f qn —1 )• 
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But by (12), p n _i/q n _i and p n /q n are extremely close to X ; since 1^7, Y — p n /qn 
and Y — p n —\/qn—\ will have the same sign as Y — X for all large n. This proves that 
Y n < 0 for all large n; hence 0 < X n < X n — Y n — 2 y/D/V n ] V n must be positive. 
Also U n < y/D, since X n > 0. Hence V n < 2 y/D, since V n < A n V n < y/D U n -i- 

Finally, we want to show that U n > 0. Since X n < 1, we have U n > y/D — V n , so 
we need only consider the case V n > y/D; then JJ n = A n V n — U n —i > V n — U n —i > 
y/D — U n — i, and this is positive as we have already observed. 

Note: In the repeating cycle, y/D + U n = A n V n + (y/D — U n —i) > V n ; hence 
L (y/D + Un+l)/Vn+l\ = L^n +1 + V n /(y/D + U n ) J = A n + i = [(y/D + U n )/V n + ij. 
In other words A n +i is determined by U n +i and V n +i; we can determine (U n ,V n ) 
from its successor {U n +uV n +i) in the period. In fact, when 0 < V n < y/D + U n 
and 0 < f/ n < y/D, the arguments above prove that 0 < V n +i < y/D -f- U n +i and 
0 < U n +i < y/D ; moreover, if the pair (I/ n +i, Vk+i) follows (t/',!^') with 0 < V' < 
a/D -f-17' and 0 < U' < a/Z>, then U' — U n and V' — V n - Hence (U n , V n ) is part of 
the cycle if and only if 0 < V n < y/D -j- U n and 0 < U n < y/D. 

( \ Fn+l V Y _ (< ]nX pn)(qnY p n ) _ 

V n n (q n -iX — p n -i)(q n -iY — pn-i)' 

There is also a companion identity, namely 

VPnPn —1 U (pnQn—1 H - Pn — 1 <?n) ((U D'j/V )(7nQn—1 == ( 1) Un- 

(d) If X n = X m for some n/m, then X is an irrational number that satisfies the 
quadratic equation ( q n X — p n )/(q n -iX — p n - 1 ) = ( q m X — p m )/(q m -\X — p m - 1 ). 

14. As in exercise 9, we need only verify the stated identities when c is the last 
partial quotient, and this verification is trivial. Now Hurwitz’s rule gives 2/e — 
/1, 2,1,2, 0,1,1,1,1,1, 0, 2, 3, 2, 0,1,1, 3,1,1,0, 2, 5,... /. Taking the reciprocal, collaps¬ 
ing out the zeros as in exercise 9, and taking note of the pattern that appears, we find 
(cf. exercise 16) that e/2 = 1 -f / 2, 2m -f 1,3,1,2m -(- 1,1, 3/, m > 0. [Schriften der 
phys.-okon. Gesellschaft zu Konigsberg 32 (1891), 59-62.] 

15. (This procedure maintains four integers (A, B, C, D) with the invariant meaning 
that “our remaining job is to output the continued fraction for (Ay + B)/(Cy + D), 
where y is the input yet to come.”) Initially set j <— k <— 0, (A, B, C, D) <— (a, b, c, d); 
then input Xj and set ( A,B,C,D ) <— (Axj + B,A,Cxj -f- D,C ), /<—/-(- 1, one or 
more times until C -f- D has the same sign as C. (When j > 1 and the input has 
not terminated, we know that 1 < y < oo; and when C + D has the same sign 
as C we know therefore that (Ay -\- B)/(Cy + D) lies between (A -(- B)/(C + D) and 
A/C.) Now comes the general step: If no integer lies strictly between (A-\-B)/(C -\-D) 
and A/C, output X k <- [A/C\, and set (A,B,C,D) <- (C, D,A - X k C, B - X k D), 
k <— k —|- 1; otherwise input Xj and set ( A,B,C,D) «— (Ax 3 -j- B,A,Cx 3 -f- D,C ), 
j *— j 1. The general step is repeated ad infinitum. However, if at any time the 
final x 3 is input, the algorithm immediately switches gears: It outputs the continued 
fraction for (Ax 3 -\- B)/(Cx 3 + D), using Euclid’s algorithm, and terminates. 

The following tableau shows the working for the requested example, where the 
matrix c) begins at the upper left corner, then shifts right one on input, down one 
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on output: 



M. Mendes France has shown that the number of quotients output per quotient input 
is asymptotically bounded between 1 fr and r, where r = 2[K{\ad — 6c|)/2j -f-1 and K 
is the function defined in exercise 38; this bound is best possible. [Topics in Number 
Theory, ed. by P. Turan, Colloquia Math. Soc. Janos Bolyai 13 (1976), 183-194.] 

Gosper has also shown that the above algorithm can be generalized to compute 
the continued fraction for ( axy -j- bx cy -f- d)/(Axy -f- Bx + Cy + D) from those of x 
and y (in particular, to compute sums and products). [MIT AI Laboratory Memo 239 
(Feb. 29, 1972), Hack 101.] 

16. It is not difficult to prove by induction that f n {z) = zj{2n-\- 1) -|- 0(z 3 ) is an odd 
function with a convergent power series in a neighborhood of the origin, and that it 
satisfies the given differential equation. Hence 

fo(z) — /z 1 + fi(z)/ = ■ ■ ■ = /z 1 ,3z l , ..., (2n \)z 1 -j- / n +i(z)/. 

It remains to prove that lim n —oo fz~ x ,3z —1 ,..., (2n + 1 )z~ 1 / = fo(z). [Actually 
Euler, age 24, obtained continued fraction expansions for the considerably more general 
differential equation /^(z) — az m bf n (z)z m ~ 1 + cf n {z) 2 ] but he did not bother to 
prove convergence, since formal manipulation and intuition were good enough in the 
eighteenth century.] 

There are several ways to prove the desired limiting equation. First, letting 
U(z) = £* a nkZ k , we can argue from the equation 

(2 n -[- l)cini -f- (2ri -f- 3)o,n3Z 2 -f- (2 n -(- 5)a n 5Z 4 

= 1 - (&nlZ dn3Z 3 On5^ 5 ‘ 1 - ) 2 

that (— l) fc a n ( 2 fc+i) is a sum of terms of the form Ck/{2n J \-\) k+1 {2n-\-bki )... (2 n-\-bkk), 
where the Ck and bkm are positive integers independent of n. For example, we have 
—a n7 = 4/(2n + l) 4 (2n -f 3)(2n + 5)(2n + 7) + 1/(2 n + l) 4 (2n + 3) 2 (2n + 7). Thus 
|a(n+i)fc| < |cinfc|, and \fn{z)\ < tan|^[ for \z\ < 7t/ 2. This uniform bound on f n (z ) 
makes the convergence proof very simple. Careful study of this argument reveals that 
the power series for f n (z ) actually converges for \z\ < Try^n -f- 1/2; this is interesting, 
since it shows that the singularities of f n {z) get farther and farther away from the 
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origin as n grows, so the continued fraction actually represents tanh 2 throughout the 
complex plane. 

Another proof gives further information of a different kind: If we let 

0<fc<n v ' fc>0 y 


then 


„ ( n + k — 1)! ((4n + 2)fe + (n + 1 — k)(n — k)) n+1 „ fc 

A,+lW = L, -fc! (n + 1 — k]\ - 2 


rt ^ k \ ( n 4.1 _ k )\ 

k> 0 v ’ 

= (4 n -f- 2)A n (z) -f- z 2 A n —\(z). 

It follows, by induction, that 

/I 3 2n — 1\ _ A n (2z) -f- A n ( — 2z) 

^Iz’z’'"’ z )- 2 n + 1 z n 

n (- 2n —1 \ _ A n {2z) — A n (— 2z) 
n u’ ’ 2 / 2 n + x z n 


Hence 


. >*‘-*-1- %%,+%-% 


and we want to show that this ratio approaches tanh 2 . By Eqs. 1.2.9-11 and 1.2.6-24, 

m> 0 \0<fc<n ^ ^ ^ ' J m > 0 ' ' 


Hence 


e‘A n (-z) - A n (z) = R n (z) = (-l)V n+1 £ 


(n + k)\ x k 
(2 n —|- /c —(— 1)! fc! 


We now have (e 2z — l)(A n (2z) + A n (—2z))~ (e 2z J r l)(A n (2z) — A n (—2z)) = 2Ft n {2z)-, 
hence 

tanh z — !z-\Zz~\. ■ ■ ,(2n - 1)^1 = ^ ++ 1} - 

Thus we have an exact formula for the difference. When \z\ < 1, the factor e 2z -{- 1 is 
bounded away from zero, \R n (2z)\ < en!/(2n-4~ 1)!, and 


tanh z — jz ,3z ,...,(2 n — 1 )z j 


%\A n (2z) + A n (—2z)\ > n! 


2 n — 2 


2n — 4 


^ ( 2rl ) ! fl i i ...... 2 ( 2 ") ! 

Thus convergence is very rapid, even for complex values of z. 

To go from this continued fraction to the continued fraction for e z , we have 
tanh 2 = 1 — 2/{e 2z 1); hence we get the continued-fraction representation for 

(e 2z -j- l)/2 by simple manipulations. Hurwitz’s rule gives the expansion of e 2z -(- 1, 
from which we may subtract unity. For n odd, 

e~ 2y/n = 1 1,3ran -f- [n/2j, (12m + 6)n, (3m -f- 2)n -f- [n/2j, 1/, m > 0. 

Another derivation has been given by C. S. Davis, J. London Math. Soc. 20 (1945), 
194-198. 
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17. (b) /xi — 1,1, X 2 — 2,1, X 3 — 2,1,..., 1, x 2n —i — 2,1, X 2 n — 1/. [Note: One can 
remove negative parameters from continuants by using the identity 

Qm-\-n-\-l{Xx ) ■ ■ • , %n> X, • j 2/l) == 

( 1) 2 (^ 1 ? • • • ) Xn —lj Xn 1,1, X 1, l/m) • • • > U l), 

from which we obtain 

Qm-fn+l {Xl, ■ ■ ■ , X n , X, Um, ■ ■ ■ > Vl) = 

3 (^ 1 , • • - , X n — 1 , X n 1, 1, x 2,1, y m 1, y m — l ; ...» 2 / 1 ) 

after a second application.] 

(c) 1 + /l, 1,3,1,5,1,... / = 1 + /2m + 1,1/, m>0. 

19. The sum for 1 < fc < iV is log b ((l -j- x)(N -f- 1)/(7V + 1 + x)). 

20. Let H = SG, g(x) = (1 + x)G'{x), h(x) = (1 + x)H'(x). Then (35) implies that 
h{x + l)/(x + 2) — fc(x)/(x + 1) = —(1 + x)- 2 g{ 1/(1 + x))/(l + 1/(1 + x)). 

21. <p(x) = c/(cx + l) 2 -j-(2 — c)/((c—l)x-fl) 2 , C/y?(x) — 1 /(x —f-c) 2 . When c < 1, the 

minimum of (p(x)/U<p(x ) occurs at x = 0 and is 2c 2 < 2. When c ^ 4> — “I - l)> 

the minimum occurs at x = 1 and is < 0 2 . When c « 1.31266 the values at x = 0 and 
x — 1 are nearly equal and the minimum is > 3.2; the bounds (0.29 ) n <p < U n (p < 
(0.31) n ^? are obtained. Still better bounds come from well-chosen linear combinations 
t 9 { x ) = E a j/( x + c j)- 

23. By the interpolation formula of exercise 4.6.4-15 with x 0 — 0, x\ = x, X 2 = x-j-e, 
letting e -> 0, we have the general identity R' n (x) = (R n (x) — i? n (0))/x + Jxi?"(0(x)) 
for some 6 {x) between 0 and x, whenever R n is a function with continuous second 
derivative. Hence in this case R' n (x) = 0(2 ~ n ). 

24. 00 . [A. Khinchin, in Compos. Math. 1 (1935), 361-382, proved that the sum 

A\ -\ -f- An of the first n partial quotients of a real number X will be asymptotically 

nlgn, for almost all X. Exercise 35 shows that the behavior is different for rational X.] 

25. Any union of intervals can be written as a union of disjoint intervals, since we have 
Uk>i Ik = Uk>i( Ik \U\<j < k I j)’ &nd this is a disjoint union in which I k \Ui<j<jt 
can be expressed as a finite union of disjoint intervals. Therefore we may take I — 
(J A, where R is an interval of length e/ 2 k containing the kth. rational number in [0,1], 
using some enumeration of the rationals. In this case p(I) < e, but ||I fl F n || = n for 
all n. 

26. The continued fractions I A,,... ,A t l that appear are precisely those for which 
A x > 1, A t > 1, and Q t {Ax, A 2 ,..., A t ) is a divisor of n. Therefore (6) completes the 
proof. [Note: If mi/n = jAx,...,A t / and m 2 In = fA t ,... ,Ax/, where m\ and 
m 2 are relatively prime to n, then mi m 2 = ±1 (modulo n); this rule defines the 
correspondence. When Ai = 1 an analogous symmetry is valid, according to (44).] 

27. First prove the result for n — p e , then for n = rs , where r and s are relatively 
prime. Alternatively, use the formulas in the next exercise. 

28. (a) The left-hand side is multiplicative (see exercise 1.2.4-31), and it is easily 
evaluated when n is a power of a prime, (c) From (a), we have Mobius’s inversion 
formula: If f(n) = J2 d \ n g{d), then g(n) = n(n/d) f{d). 
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29. The sum is approximately (((121n2)/7r 2 )ln N\)/N — ^ d>1 A(d)/d 2 -|- 1.47; here 
Y^d>i M>d)/d 2 converges to the constant value —and we know that In AT! = 
NlnN — N -j- O(logiV) by Stirling’s approximation. 

30. The modified algorithm affects the calculation if and only if the following division 
step in the unmodified algorithm would have the quotient 1, and in this case it avoids 
the following division step. The probability that a given division step is avoided is 
the probability that Ak = 1 and that this quotient is preceded by an even number of 
quotients equal to 1. By the symmetry condition, this is the probability that Ak = 1 
and is followed by an even number of quotients equal to 1. The latter happens if and 
only if Xk—i > 0—1 = 0.618..., where 0 is the golden ratio: For Ak = 1 and Ak +1 > 
1 iff § < Xk —i < 1 ; Ak — Afc.fi — Ak ~\~2 — 1 and Ak -\-3 > 1 iff § < Xk~ 1 < §; 
etc. Thus we save approximately Ffc_i(l) — Fk—i{<p — 1) sa 1 — lg0 0.306 of the 
division steps. The average number of steps is approximately ((12 In 0)/7r 2 ) In n, when 
v — n and u is relatively prime to n. Kronecker [Voriesungen iiber Zahlentheorie 1 
(Leipzig: Teubner, 1901), 118] observed that this choice of least remainder in absolute 
value always gives the shortest possible number of iterations, over all algorithms that 
replace u by (iu)modt> at each iteration. For further results see N. G. de Bruijn and 
W. M. Zaring, Nieuw Archief voor Wiskunde (3) 1 (1953), 105-112; G. J. Rieger, Math. 
Nachr. 82 (1978), 157-180. 

On many computers, the modified algorithm makes each division step longer; the 
idea of exercise 1, which saves all division steps when the quotient is unity, would be 
preferable in such cases. 

31. Let a 0 = 0, ai = 1, a n +1 = 2a„-j-a n — 1 ; then a n = ((l+\/2) n —(1—v / 2) 7l )/2\/2, 
and the worst case (in the sense of Theorem F) occurs when u — a n -f~ a n — 1 , v = a n , 
n > 2. 

This result is due to A. Dupre [J. de Math. 11 (1846), 41-64], who also inves¬ 
tigated more general “look-ahead” procedures suggested by J. Binet. See P. Bachmann, 
Niedere Zahlentheorie 1 (Leipzig: Teubner, 1902), 99-118, for a discussion of early 
analyses of Euclid’s algorithm. 

32. (b) Qm— i(xi, • • •, Xm—i)Qn— i(x m+2 ,..., x m +n) corresponds to Morse code se¬ 
quences of length (m -j- n) in which a dash occupies positions m and (m - j-1); the other 
term corresponds to the opposite case. (Alternatively, use exercise 2. The more general 
identity 

Qm-\-n(%l j • • • > k{%rn-±- 1> • • • » 2)m+fc) —— 

Qm+le(lI; • • • j %m-\-k)Q n(%m- f-1) ■ ■ ■ ? ^m + n) 

"4” ( 1) Qm — l(Xi, ■ ■ ■ , — l)Qn— k —1 k-\-2j • ■ • j ^m + n) 

also appeared in Euler’s paper.) 

33. (a) The new representations are x = m/d, y = (n — m)/d, x' = y' = d = 

gcd (m,n — m), for \n < m < n. (b) The relation {n/x') — y < x < n/x' defines x. 
(c) Count the x' satisfying (b). (d) A pair of integers x > y > 0 with gcd(x, y) — 1 
can be uniquely written in the form x = Q m {x 1 ,..., x m ), y = Qm - 1 ( 2 : 1 , ■ • •, x m — 1 ), 
where Xi > 2 and m > 1; here y/x = /x m ,...,xi/. (e) It suffices to show that 

Ei</c<n/ 2 T ( /c ’ n ) =:2 L n / 2 J + M^)- Fori < k < n/2 the continued fractions k/n = 
/x 1 ,..., Xml run through all sequences (xi,..., x m ) such that m > 1, Xi > 2, x m > 2, 
Qm(xi,..., x m ) \ n; and T(k,n) = 2 + (m — 1). 
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34. (a) Dividing x and y by gcd(x, y) yields g(n) = J2d\n K n / d )\ apply exercise 28(c), 
and use the symmetry between primed and unprimed variables, (b) For fixed y and t, 
the representations with xd > x' have x' < Vnd; hence there are 0{Vnd/y ) such 
representations. Now sum for 0 < t < y < y/n/d. (c) If s(y) is the given sum, 

then T, d \y 5 ( rf )= y( H *y — H v ) = k (y ■)» sa y; hence s[y) = J 2 d \ y k (y/ d )- Now Kv) — 
y\n2-\+0{l/y). (d) £i< y < n <p{y)/y 2 = Ei< y < n ,d\ y M/V d = E cd < n Mfcd 2 . 
(Similarly, Ei< y <n ff -i(^ 2 = ( e ) Ei< fc < n ti k )/ k2 = 6/tt 2 + 0(1/n) (see 

exercise 4.5.2—10(d)); and Ei<fc<n ^{k)\ogk/k 2 = 0(1). Hence we have hd(n) = 
n((3 In 2)/7r 2 ) ln(n/d)-f O(n) for ~d > 1. Finally h[n) — 2^2 cd ^ n p,(d)h c {n/cd) = 
((6 In 2)/7r 2 )n(ln n — E-EO -\- 0(n<7_i(n) 2 ), where the remaining sums are ^ = 

Ec d\n^ d ) ln ( Cd )/ C(i = 0 and E' = I2cd\n^ d ) lnc / Cd = T,d\n A< < d y d - [ II is Wel1 

known that cr_i(n) = O(loglogn); cf. Hardy and Wright, Theory of Numbers, §22.9.] 

35. See Proc . Nat Acad. Sci. 72 (1975), 4720-4722. 

36. Working the algorithm backwards, we want to choose ..., k n —i so that 
Uk = F kl ... Fk i _ l Fk l — 1 (modulo gcd(ui+i,..., tt n )) for 1 < i < n, with u n = 
F kl ... F kn _ l a minimum, where the k y s are positive, k\ > 3, and k\ -(- ■ ■ • -f- fc n —i = 
N -\- n — 1. The solution is fc 2 = • • • = k n ~i —2, u n = F N -n+ 3 - [See CACM 13 
(1970), 433-436, 447-448.] 

37. See Proc. Amer. Math. Soc. 7 (1956), 1014-1021; cf. also exercise 6.1-18. 

38. Let m — \n/(p], so that m/n = + e = /xi, X 2 ,... / where 0 < e < l/n. Let 

k be minimal such that x k > 2; then (0 1_ ~ fc + (—l) /c F fc _ie)/(^ _fc — (— l) k F k e) > 2, 
hence k is even and <p~ 2 — 2 — (j> < <p k F k+2 e — (0 2/c+2 — <p~ 2 )e/\/5. [Ann. Polon. 
Math. 1 (1954), 203-206.] 

39. At least 287 at bats; / 2,1, 95/ — 96/287 — .33449477..., and no fraction with 
denominator < 287 lies in the interval [.3335, .3345] = [/ 2,1, 666/, / 2,1, 94,1,1,3/]. 

To solve the general question of the fraction in [a, b] with smallest denominator, 
where 0 < a < b < 1, note that in terms of regular continued-fraction representations 
we have /xi,£ 2 ,.../ < Iyi,y 2 , ■ ■ ■ J iff (—1 Yxj < (—Pfyj for the smallest j with 
x j 7 *“ yj> where we place “oo” after the last partial quotient of a rational number. 
Thus if a ~ /xi, X 2 ,... / and b = /j/i, y 2 , ... /, and if j is minimal with Xj ^ yj, the 
fractions in [a, b] have the form c = /xi,..., Xj—\,Zj ,..., Zm/ where fzj,..., ZmJ lies 
between / Xj, xy+i,... / and /%, yj+i, ■ • ■ / inclusive. Let Q—\ — 0. The denominator 

Qj — l(Xi, . . . , Xj — l )Qm —■ j 2m) H - Qj — 2 (^ 1 , • • • ; %j —2 )Qm —j(^j + l j • • • j 2m) 

of c is minimized when m = j and ^ = {j odd =* yj+ l — 6 yj+l00 ‘, Xj+ l — 6 Xj+l00 )- 
[Another way to derive this method comes from the theory in the following exercise.] 

40. One can prove by induction that p r Qi — piq r — 1 at each node, hence pi and qi 
are relatively prime. Since p/q < p'/q' implies that p/q < (p -j- p')/{q + q') < p'/q', 
it is also clear that the labels on all left descendants of p/q are less than p/q, while the 
labels on all its right descendants are greater. Therefore each rational number occurs 
at most once as a label. 

It remains to show that each rational does appear. If p/q = /ai,..., a r , 1/, where 
each cii is a positive integer, one can show by induction that the node labeled p/q is 
found by going left ai times, then right a 2 times, then left a 3 times, etc. 
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[The sequence of labels on successive levels of this tree was first studied by M. A. 
Stern, J. fur die reine und Angew. Math. 55 (1858), 193-220, although the relation 
to binary trees is not explicit in his paper. Peirce independently communicated this 
construction in a letter dated July 17, 1903, but he never published it; and during the 
next few years he occasionally amused himself by making rather cryptic remarks about 
it without revealing the underlying mechanism. See C. S. Peirce, The New Elements of 
Mathematics 3 (The Hague: Mouton, 1976), 781-784, 826-829; also 1, 207-211; and his 
Collected Papers 4 (1933), 276-280. See also D. H. Lehmer, AMM 36 (1929), 59-67.] 

41. In fact, the regular continued fractions for numbers of the general form 

i (~ir (-ip 

h Ilk l\l\l 3 

have an interesting pattern, based on the continuant identity 

Qm-\-n-\- l(^li • • • > %n — 1 > X n 1 > 1; pm 1> Hm — 1 > ■ ■ ■ > J/l) == 

XnQn —l(Xl, • • • j Xn — 1 }Q * • * ) yi ) 

“b ( 1) Qm+n{Xi, ■ ■ ■ , Xn —1,0, Vm.) J/m—1, • • • , Vl)' 

This identity is most interesting when = x n — 1 , Vm-i = x n — 2 , etc., since 


Qn-\- l(2i,..., Zlc , 0, Zk-\- 1, . . . , Zn'j — Qn — 1 (^l j • ■ • ? %k — 1, ~ f - Zk-\- 1, %k-\- 2, • • • , ^n)* 

In particular we find that if p n /q n = Qn-i{x 2 ,..., x n )/Q n {x i,..., x„) = /xi,..., x n f, 
then pn/qn + {—T) n /q 2 n r = jxi ,..., x n , r — 1,1, x n — 1, x n -i ,..., Zi/. By changing 
lx i,..., x n / to /x i,..., a: n _i, x n — 1,1/, we can control the sign (—l) n as desired. 

For example, the partial sums of the first series have the following continued 
fractions of even length: /1,1/; /l, 1,1,1,0,1/ = /1,1,1, 2/; /l, 1,1, 2,1,1,1,1,1,1/; 
/l, 1,1, 2,1,1,1,1,1,1,1,1, 0,1,1,1,1,1, 2,1,1,1/ = /l, 1,1,2,1,1,1,1,1,1,1,2,1,1,1, 
1,2,1,1,1/; and from this point on the sequence settles down and obeys a simple 
reflecting pattern. We find that the nth partial quotient a n can be computed rapidly 
as follows, if n — 1 = 20q -j- r where 0 < r < 20: 


/ 


1, 

2, 






1 + {q + r)mod2, 

2 d q , 


y 1 dq + 1 , 


if r = 0, 2,4, 5,6,7,9,10,12,13,14,15,17 or 19; 
if r = 3 or 16; 
if r = 8 or 11; 
if r = 1; 
if r = 18. 


Here d n is the “dragon sequence” defined by the rules do — 1, d 2 n — d n> d^ n +i — 0, 
d 4 n+ 3 = 1; the dragon curve discussed in exercise 4.1-18 turns right at its nth step iff 
d n —- 1 • 

Liouville’s numbers with l > 3 are equal to Jl — 1, l -J-1, l 2 — 1, 1, l, l — 1, l 12 — 1, 
1, l — 2,1, 1 , l 2 — 1, l -f* 1, l — 1, l 72 — 1, • • • /• The nth partial quotient a n depends 
on the dragon sequence on mmod4 as follows: If nmod4 — 1 it is l — 2 d n — i + 
([n/2j mod 4) and if nmod4 = 2 it is l -j- 2 — d n + 2 — (L n /2J mod 4); if nmod4 = 0 it 
is 1 or l k '( k —P — depending on whether or not d n = 0 or 1, where k is the largest 
power of 2 dividing n; and if nmod4 = 3 it is l k '( k ~P — 1 or 1, depending on whether 
d n +i = 0 or 1, where k is the largest power of 2 dividing n -|- 1. When l = 2 the same 
rules apply, except that 0’s must be removed, so there is a more complicated pattern 
depending on n mod 24. 

[Cf. J. Number Theory 11 (1979), 209-217.] 




608 ANSWERS TO EXERCISES 


4.5.3 


42. Suppose that ||qX|| — \qX — p|. We can always find integers u and v such that 
q = uq n - 1 + uqn and p = up n - 1 + ^Pn, where p n = Q n -i(A 2 ,..., A n ), since 
q n p n — i — <?n—iPn = ±1- We must have uv < 0, hence u(q n —\X — p n — 1 ) has the 
same sign as v(q n X — p n ), and \qX — p\ — |u| \q n —iX — p n — 1 | + |v| \q n X — p n \. This 
completes the proof, since u 7 ^ 0. See Theorem 6.4S for a generalization. 

43. If x is representable, so is the father of x in the Stern-Peirce tree of exercise 40; 
thus the representable numbers form a subtree of that binary tree. Let ( u/u') and 
(■ v/v') be adjacent representable numbers. Then one is an ancestor of the other; say 
{u/u 1 ) is an ancestor of {v/v'), since the other case is similar. Then {u/u') is the nearest 
left ancestor of {v/v'), so all numbers between u/u' and v/v' are left descendants of 
{v/v') and the mediant ((u -f v)/{u' -f- v')) is its left son. According to the relation 
between regular continued fractions and the binary tree, the mediant and all of its left 
descendants will have {u/u') as their last representable Pi/qi, while all of the mediant’s 
right descendants will have {v/v') as one of the Pi/qi. (The numbers pi/qi label the 
fathers of the ‘turning-point’ nodes on the path to x.) 

44. A counterexample for M = N = 100 is {u/u') = {v/v') = §g. However, the 

identity is almost always true, because of (12); it fails only when u/u! + v/v' is very 
nearly equal to a fraction that is simpler than {u/u'). 

45. See M. S. Waterman, BIT 17 (1977), 465-478. 


SECTION 4.5.4 

1. If dk isn’t prime, its prime factors are cast out before dk is tried. 

2. No; the algorithm would fail if pt—i = pt, giving “1” as a spurious prime factor. 

3. Let P be the product of the first 168 primes. [Note: Although P == 19590... 5910 
is a 416-digit number, such a gcd can be computed in much less time than it would 
take to do 168 divisions, if we just want to test whether or not n is prime.] 

4. In the notation of exercise 3.1-11, 

£ 2 r. 8 ™(, + i,x,i P(MiX) = i£ /(i) JJ (i-A 

i>l 1 <fc<A 

where f{l) = Yh<\<i 2 rigmax(i ~ x ’ x)1 . If l = 2 k+e , where 0 < 6 < 1, we have f{l) = 

l 2 {3 • 2~ 6 — 2 • 2~ 29 ), where the function 3 • 2~ 6 — 2 • 2~ 29 reaches a maximum of § 
at Q = lg(4/3) and has a minimum of 1 at 9 = 0 and 1. Therefore the average value 
0 j< 2 rigmax(M+i,x)l ii eg between 1.0 and 1.125 times the average value of p + X, and the 
result follows. 

[Algorithm B is a refinement of Pollard’s original algorithm, which was based on 
exercise 3.1-6(b) instead of the (yet undiscovered) result in exercise 3.1-7. He showed 
that the least n such that X 2n — X n has average value ~ {it 2 /l2)Q{m); this constant 
7 t 2 /12 is explained by Eq. 4.5.3-21. Hence the average value of 3n in his original 
method is ~ (7r/2) 5;/2 - v /m — 3.092 \fm. Richard Brent has observed that, as m —► 00 , 
the density <fc<i (l — A:/m) — exp {—l{l — l)/2m-\-0{l 3 /m 2 )) approaches a normal 

distribution, and we may assume that 9 is uniformly distributed. Then 3-2~ 9 — 2-2~ 26 
takes the average value 3/(4 In 2), and the average number of iterations needed by 
Algorithm B comes to ~ (3/(4 In 2) + %)\jTrm/2 = 1.983v^- A similar analysis of 
the more general method in the answer to exercise 3.1-7 gives ~ 1.926 \/m, when p = 
2.4771 is chosen “optimally” as the root of (p 2 — l)ln p = p 2 — p- j- 1.] 
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5. xmod3 — 0; xmod5 — 0, 1, 4; a: mod 7 = 0, 1, 6; a: mod 8 — 1, 3, 5, 7; x > 103. 
The first try is x = 105; and (105) 2 — 10541 == 484 = 22 2 . This would also have been 
found by Algorithm C in a relatively short time. Thus 10541 — 83 • 127. 

6 . Let us count the number of solutions ( x , y) of the congruence N = (x — y)(x -\- y) 
(modulo p), where 0 < x, y < p. Since N ^ 0 and p is prime, x -\- y ^ 0. For each 
v e^e 0 there is a unique u (modulo p) such that N = uv. The congruences x — y = u, 
x-\-y = v now uniquely determine xmodp and ymodp, since p is odd. Thus the stated 
congruence has exactly p — 1 solutions (x,y). If (x, y) is a solution, so is ( x,p — y) if 
y 7 ^ 0 , since (p — y) 2 = y 2 ; and if (x, pi) and (x, p 2 ) are solutions with y\ 7 ^ y 2 , we 
have y\ = y\\ hence pi = p — p 2 - Thus the number of different x values among the 
solutions (x, y) is (p — l)/2 if N = x 2 has no solutions, or (p -f- l)/2 if N = x 2 has 
solutions. 

7. One procedure is to keep two indices for each modulus, one for the current word 
position and one for the current bit position; loading two words of the table and doing 
an indexed shift command will bring the table entries into proper alignment. (Many 
computers have special facilities for such bit manipulation.) 

8. (We may assume that N = 2 M is even.) The following algorithm uses an auxiliary 
table X[l], X[2], ..., X[M], where X[fc] represents the primality of 2k + 1. 

51. Set X[k ] <— 1 for 1 < k < M . Also set j <— 1, p «— 1, p «— 3, q «— 4. (During 
this algorithm p — 2j -j- 1, q — 2j -f- 2 j 2 \ the integer variables j, k, p, q may 
readily be manipulated in index registers.) 

52. If X[j] = 0, go to S4. Otherwise output p, which is prime, and set k <— q. 

53. If k < M, then set X[k\ -<—0, k <— k -j- p, and repeat this step. 

54. Set j <— j 1, p <— p 2, q <— q-\-2p — 2. If j < M, return to S2. | 

A major part of this calculation could be made noticeably faster if q (instead of j) were 
tested against M in step S4, and if a new loop were appended that outputs 2 j + 1 for 
all remaining X\j] that equal 1, suppressing the manipulation of p and q. 

Improvements in the efficiency of sieve methods for generating primes are discussed 
in exercise 5.2.3-15 and in Section 7.1. 

9. If p 2 is a divisor of n for some prime p, then p is a divisor of \(n), but not of n — 1. 
If n — P 1 P 2 , where pi < P 2 are primes, then P 2 — 1 is a divisor of X(n) and therefore 
P1P2 — 1 = 0 (modulo p 2 — 1). Since p 2 = 1, this means pi — 1 is a multiple of p 2 — 1, 
contradicting the assumption pi < p 2 . [Values of n for which \(n) properly divides 
n — 1 are called “Carmichael numbers.” For example, here are some small Carmichael 
numbers with up to six prime factors: 3-1117, 5 • 13 ■ 17, 7 • 11 • 13 • 41, 5 • 7 ■ 17 ■ 19 • 73, 
5-7-17-73-89- 107.] 

10. Let k p be the order of x p modulo n, and let X be the least common multiple of all 
the kpS. Then X is a divisor of n — 1 but not of any (n — l)/p, so X = n — 1. Since 
x^^modn = 1, p>(n) is a multiple of k p for all p, so <p(n) > X. But <p{n) < n — 1 
when n is not prime. (Another way to carry out the proof is to construct an element 
x of order n — 1 from the x p ’s, by the method of exercise 3.2.1.2-15.)' 
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u 

V 

A 

P 

S 

T 

Output 

1984 

1 

0 

992 

0 

— 


1981 

1981 

1 

992 

1 

1981 


1983 

4 

495 

993 

0 

1 

993 2 = -f 2 2 

1983 

991 

2 

98109 

1 

991 


1981 

4 

495 

2 

0 

1 

2 2 = +2 2 

1984 

1981 

1 

99099 

1 

1981 


1984 

1 

1984 

99101 

0 

1 

99101 2 = +2° 


The factorization 199 • 991 is evident from the first or last outputs. The shortness 
of the cycle, and the appearance of the well-known number 1984, are probably just 
coincidences. 

12. The following algorithm makes use of an auxiliary (to + 1) X (ra -j- 1) matrix of 
single-precision integers Ej k , 0 < j, k < m; a single-precision vector (&o, bi, ■ ■ ■, b m ); 
and a multiple-precision vector ( xo , X \,..., x m ) with entries in the range 0 < x k < N. 

FI. [Initialize.] Set b t < -1 for 0 < i < m; then set j «— 0. 

F2. [Next solution.] Get the next output (x, e 0 , ei,..., e m ) produced by Algorithm E. 
(It is convenient to regard Algorithms E and F as coroutines.) Set k <— 0. 

F3. [Search for odd.] If k > m go to step F5. Otherwise if e k is even, set k <— k -f-1 
and repeat this step. 

F4. [Linear dependence?] If b k > 0, then set i <— b k , x <— (x t x) mod N, e r <— e r + Eir 
for 0 < r < ra; set k <— k -f- 1 and return to F3. Otherwise set b k <— j, x 3 <— x, 
Ejr <- e r for 0 < r < m; set j *- j -f 1 and return to F2. (In the latter case we 
have a new linearly independent solution, modulo 2, whose first odd component is 
e*.) 

F5. [Try to factor.] (Now e Q , ei, ..., e m are even.) Set 

If x — y or if x + y = AT, return to F2. Otherwise compute gcd(x — y, N), which 
is a proper factor of N, and terminate the algorithm. | 

It can be shown that this algorithm finds a factor, whenever it is possible to deduce 
one from the given outputs of AJgorithm E. [Proof: Let the outputs of Algorithm E 
be (Xi, Eto ,..., Eim) for 1 < i < t, and suppose that we could find a factorization 
N = N 1 N 2 when x = X® 1 .. .X “' 1 and y = (—1 ) eo//2 pj 1 ^ 2 ... (modulo N ), where 

e 3 = a\E\j -)- \- a t E t j is even for all j. Then x = ±y (modulo Ni) and x = =F y 

(modulo N 2 ). It is not difficult to see that this solution can be transformed into a pair 
(x, y) that appears in step F5, by a series of steps that systematically replace ( x, y) by 
{xx', yy') where x' = ±y' (modulo N).} 

13. There are 2 d values of x having the same exponents (e 0 ,..., e m ), since we can 
choose the sign of x modulo q{ 1 arbitrarily when N — q{ 1 .. .q f f. Exactly two of these 
2 d values will fail to yield a factor. 

14. Since P 2 = kNQ 2 (modulo p) for any prime divisor p of V, we have 1 = 

p2(p-l)/2 _ ( fciV Q 2 )(p-i)/2 = ( fciV )(p-i)/2 (modulo p) > if P ^ 0 . 
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15. U n — ( a n —b n )/VD , where a — ^(P+v'D), b = \{P—\fD), D = P 2 —AQ. Then 
2 n— 1 U n = Gfc+i)-^ -2fc_ l D k \ so Up = (modulo p) if p is an odd prime. 

Similarly, if V n = a n -f- b n = U n +i — QU n - i, then 2 n ~ 1 V n = ( 2 \)P n ~ 2k D k , 

and V p = P n = P. Thus if U p = —1, we find that U p + 1 modp = 0. If U p = 1, we 
find that (QU p —i) modp = 0; here if Q is a multiple of p, U n = P n ~ 1 (modulo p) for 
n > 0, so U n is never a multiple of p; if Q is not a multiple of p, U p — 1 modp — 0. 
Therefore as in Theorem L, UtmodiV = 0 if N = p\ x ...p e r r , gcd {N,Q) = 1, and 
t = lcm 1 <j< r (p^ 1 (p J + e?)). Under the assumptions of this exercise, the rank of 
apparition of IV is N + 1; hence TV is prime to Q and t is a multiple of N + 1. Also, 
the assumptions of this exercise imply that each Pj is odd and each €j is so t < 
2l_r II Vj~ l iV 3 + JPj) = 2(§) r iV; hence r = 1 and t = p^ 1 + eip' 1-1 . Finally, 
therefore, ei = 1 and €1 = 1. 

Note: If this test for primality is to be any good, we must choose P and Q in 
such a way that the test will probably work. Lehmer suggests taking P = 1 so that 
D = 1 — 4 Q, and choosing Q so that gcd(iV, QD) = 1. (If the latter condition fails, 
we know already that N is not prime, unless \QD\ > N.) Furthermore, the derivation 
above shows that we will want e\ = 1, that is, D ( - N ~ 1 U 2 = — 1 (modulo N). This 
is another condition that determines the choice of Q. Furthermore, if D satisfies this 
condition, and if 1 mod iV ^ 0, we know that N is not prime. 

Example: If P = 1 and Q = —1, we have the Fibonacci sequence, with D — 5. 
Since 5 11 = —1 (modulo 23), we might attempt to prove that 23 is prime by using the 
Fibonacci sequence: 


(. F n mod 23) = 0,1,1, 2, 3, 5, 8,13, 21,11, 9, 20,6,3,9,12, 21,10, 8,18,3, 21,1, 22,0,..., 


so 24 is the rank of apparition of 23 and the test works. However, the Fibonacci 
sequence cannot be used in this way to prove the primality of 13 or 17, since Ft mod 
13 — 0 and Fg mod 17 = 0. When p = ^1 (modulo 10), we have modp = 1, 

so F p —i (not Fp+i) is divisible by p. 

17. Let f(q) = 2 lg q — 1. When q — 2 or 3, the tree has at most f(q) nodes. When 
q > 3 is prime, let q = 1 + q\ ... qt where t > 2 and qi, ..., qt are prime. The size of 
the tree is < 1 + ^ /(<7fc) = 2 + f(q — 1) — t < f(q). [SIAM J. Computing 7 (1975), 
214-220.] 

18. x(G(a ) — F’(a)) is the number of n < x whose second-largest prime factor is < x a 
and whose largest prime factor is > x a . Hence 

xG'{t) dt = (ft(x t+dt ) — nix*)) ■ x 1 ~ t (G(t/{ 1 — £)) — F(t/( 1 — i))). 

The probability that p t —\ < \fpt is F(t/ 2(1 — t))t~ x dt. [Curiously, it can be 

shown that this also equals f* F(t/(1 — t))dt, the average value of log pt /log x, and it 
also equals Golomb’s constant X of exercises 1.3.3-23 and 3.1-13. The derivative G'(0) 

can be shown to equal f* F(t/( 1 — t))t ~ 2 dt ~ F( 1) 2 F(J) + 3 F(J) + • • • = e 1 . 

The third-largest prime factor has H(o) = f*(H(t/( 1 — t)) — G(t/( 1 — £)))£ —1 dt and 
H'{ 0) = 00 . See P. Billingsley, Period. Math. Hungar. 2 (1972), 283-289; J. Galambos, 
Acta Arith. 31 (1976), 213-218; D. E. Knuth and L. Trabb Pardo, Theoretical Comp. 
Sci. 3 (1976), 321-348.] 
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19. M = 2° — 1 is a multiple of all p for which the order of 2 modulo p divides D. 
To extend this idea, let ai = 2 and Oj +1 = a p modiV, where q 3 — p e ?, p 3 is the jth 
prime, and e 3 = [log 1000/loglet A = ai 69 . Now compute b q = gcd(A q — 1, N) 
for all primes q between 10 3 and 10 5 . One way to do this is to start with A 1009 mod AT 
and then to multiply alternately by A 4 mod N and A 2 mod N. (A similar method was 
used in the 1920s by D. N. Lehmer, but he didn’t publish it.) As with Algorithm B we 
can avoid most of the gcd’s by batching; e.g., since b 30 r—k — gcd(A 30r — A k ,N), we 
might try batches of 8, computing c r = (A 30r —A 29 )(A 30r —A 23 )... (A 30r — A) mod N, 
then gcd(c r , N) for 33 < r < 3334. 

22. Algorithm P fails only when the random number x does not reveal the fact that 
n is nonprime. Say x is bad if x 9 modn = 1 or if one of the numbers x 2 ° q is = —1 
(modulo n) for 0 < j < k. Since 1 is bad, we have p n — ( b n — 1 )/{n — 2) < b n /(n — 1), 
where b n is the number of bad x such that 1 < x < n, when n is not prime. 

Every bad x satisfies x n ~ 1 = 1 (modulo n). When p is prime, the number of 
solutions to the congruence x q = 1 (modulo p e ) for 1 < x < p e is the number 
of solutions of qy = 0 (modulo p e ~ 1 (p — 1)) for 0 < y < p e ~ 1 {p — 1), namely 
gcd (q,p e ~ 1 {p — 1)), since we may replace x by a y where a is a primitive root. 

Let n = n\ l ...n e r r , where the n % are distinct primes. According to the Chinese 
remainder theorem, the number of solutions to the congruence x n ~ 1 = 1 (modulo n) 
is ri 1< t <r Scd(n— l,n^ 1—1 (ni — 1)), and this is at most Ili<i<r( ni — 1) s i nce n i is 
relatively prime to n — 1. If some e t > 1, we have n % — 1 < | n**, hence the number 
of solutions is at most | n; in this case b n < fn < \(n — 1), since n > 9. 

Therefore we may assume that n is the product ni.. ,n r of distinct primes. Let 
tit = 1 + 2 kl qi, where k\ < ■ ■ ■ < k r . Then gcd(n — l,m — 1) = 2 ki q[, where 
k'i = min (k,ki) and q[ — gcd (q,qi). Modulo m, the number of x such that x q = 1 
is qfi, and the number of x such that x 2Jq = —1 is 2 3 q' t for 0 < j < /c', otherwise 0. 
Since k > hi, we have b n = q[ ... q' r (l + T, 0 < 3<kl 2Jr )- 

To complete the proof, it suffices to show that b n < \qi ... q r 2 fcl ^ ^ fcr — \ip{ri), 

since (p(n) < n — 1. We have 

(1 + Eo< ]<kl V r )/2 k,+ - - +k ’ < (1 + Eo<j<k, 2 3> )/2 fcir 

= 1/(2' — 1) + (2 r — 2)/(2 fc ’ r (2 r — 1)) < l/2 r_1 , 

so the result follows unless r = 2 and fcj = fc 2 - If r = 2, exercise 9 shows that n — 1 
is not a multiple of both n\ — 1 and n 2 — 1. Thus if k\ = /c 2 we cannot have both 
q[ = q\ and q' 2 = <? 2 ; it follows that q\ q' 2 < and b n < $<p(n) in this case. 

[Reference: J. Number Theory (1980), to appear. The above proof shows that p n is 
near \ in only two cases, when n — (1 + 2 qi)( 1 -)-4gi) or (l+2qi)(l-j-2Q2)(l + 2q r 3). For 
example, when n = 49939 • 99877 we have b n = ^(49938 ■ 99876) and p n ~ .2499925. 
See the next answer for further remarks.] 

23. (a) The proofs are simple except perhaps for the reciprocity law. Let p = p\.. .p s 
and q = qi .. .q r , where the p x and q 3 are prime. Then 

j = = JJ(—i) ( Pi- 1)( 9j— 1)/4 j = (__ 1 )E t , J (P l -i)(q J —1)/4 
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so we need only verify that J2i j (P* ~ 1)(<7j — l)/4 = (p — l)(q — l)/4 (modulo 2). But 
J2i,j (Pi ~ — !)/ 4 = — 1 )/ 2 )(E :? (^ ~ i)/ 2 ) is odd iff an odd number of the 

Pi and an odd number of the qj are = 3 (modulo 4), and this holds iff (p — l)(q — l)/4 
is odd. 

(b) As in exercise 22, we may assume that n — ni... n r where the n* = 1 2 kl qi 

are distinct primes, and fci < • • • < fc r ; we let gcd(n — 1, m — 1) — 2 ki q'i and we call x 
bad if it falsely makes n look prime. Let n n = fli< r < r 9* 2 min(fci,fc— be the number 

of solutions of x (n—1)/2 = 1. The number of bad x with (^) — 1 is n n , times an extra 
factor of ^ if k\ < k. (This factor £ is needed to ensure that (^) = —1 for an even 
number of the rii with ki < k.) The number of bad x with (~) = —1 is Il n if ki = k, 
otherwise 0. [If x^ -1 ^ 2 = —1 (modulo rii), we have (■£-) = —1 if ki — k, (£-) — +1 
if ki > k, and a contradiction if ki < k. If k\ = k, there are an odd number of ki 
equal to k.} 

Notes: The probability of a bad guess is > \ only if n is a Carmichael number with 
k r < k) for example, n — 7 • 13 • 19 — 1729, a number made famous by Ramanujan in 
another context. Louis Monier has extended the above analyses to obtain the following 
closed formulas for the number of bad x in general: 



1 <i<r 


Here b' n is the number of bad x in this exercise, and 6 n is either 2 (if k\ — k), or ^ (if 
ki < k and ei is odd for some i), or 1 (otherwise). 

(c) If re 9 modn = 1, then 1 = (^) = (^) 9 = (^). If x 2jq = —1 (modulo n), then 
the order of x modulo rii must be an odd multiple of 2 J+1 for all prime divisors 
of n. Let n — nl 1 .. .n* r and m = 1 -\- 2 then (~) — (—l) 9 ^, so (-) = +1 or 
—1 according as ^ 'I is even or odd. Since n = (l 2 J+1 ^ aq") (modulo 2 J “ t_2 ), 
the sum e t q" is odd iff j -|- 1 = k. [Univ. de Paris-Sud, Lab. Rech. Informatique, 
Rapport 20 (1978).] 

24. Let Mi be a matrix having one row for each nonprime odd number in the range 

1 < n < N and having N — 1 columns numbered from 2 to N; the entry in row n 
column x is 1 if n passes the x test of Algorithm P, otherwise it is zero. When N — 
qn + r and 0 < r < n, we know that row n contains at most \qn -\- minQn, r) — 
\N + minQn — \r, |r) < \N -f entries equal to 0, so at least half of the 

entries in the matrix are 1. Thus, some column xi of Mi has at least half of its entries 
equal to 1. Removing column Xi and all rows in which this column contains 1 leaves a 
matrix M 2 having similar properties; a repetition of this construction produces matrix 
M r with N — r columns and fewer than N/2 r rows, and with at least %{N — 1) entries 
per row equal to 1. [Cf. Proc. IEEE Symp. Foundations of Comp. Sci. 19 (1978), 78.] 

[A similar proof implies the existence of a single infinite sequence Xi < X 2 < • • • 
such that the number n > 1 is prime if and only if it passes the x test of Algorithm P 
for x = xi, ..., x = Xm, where m = ^ [lgnj(Llg?aj — l). Does there exist a sequence 
xi < X 2 < • ■ • having this property but with m — O(logn)?] 

25. This theorem was first proved rigorously by von Mangoldt [J. fur die reine und 
angew. Math. 114 (1895), 255-305], who showed in fact that the 0(1) term is equal 
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to %(x is prime) + constant + dt/(t 2 — l)t In t. The constant is li 2 — In 2 = 7 -(- 
lnln2 + E n > 2 ( ln2 ) n / nn! = 0-35201 65995 57547 47542 73567 67736 43656 844714-. 

26. If N is not prime, it has a prime factor q < \fN. By hypothesis, every prime 
divisor p of f has an integer x p such that the order of x p modulo q is a divisor of 
N — 1 but not of (N — 1 )/p. Therefore if p k divides /, the order of x p modulo q is a 
multiple of p k . Exercise 3.2.1.2-15 now tells us that there is an element x of order / 
modulo q. But this is impossible, since it implies that q 2 > (/ +1) 2 > (/ + l)r > N, 
and equality cannot hold. [Proc. Camb. Phil. Soc. 18 (1914), 29-30.] 

27. If k is not divisible by 3 and if fc < 2 n 1, the number k-2 n + 1 is prime iff 

3 2n lfc = —1 (modulo k-2 n + 1). For if this condition holds, k-2 n —(- 1 is prime by 

exercise 26; and if k-2 n -f- 1 is prime, the number 3 is a quadratic nonresidue mod 
k-2 n -(- 1 by the law of quadratic reciprocity, since k-2 n + 1 mod 12 = 5. [This test 
was stated without proof by Proth in Comptes Rendus 87 (Paris, 1878), 926.] 

To implement Proth’s test with the necessary efficiency, we need to be able to 
compute x 2 mod ( k-2 n -f- 1 ) with about the same speed as we can compute the quantity 
x 2 mod(2 n — 1). Let x 2 = A-2 n -f B and let A = qk + r where 0 < r < fc; 

we can determine q and r rapidly, when A: is a single-precision number. Then x 2 = 

B — q -f- r-2 n (modulo k-2 n -j-1), so the remainder is easily obtained. 

[To test numbers of the form 3-2 n + 1 for primality, the job is only slightly 
more difficult; we first try random single-precision numbers until finding one that is a 
quadratic nonresidue mod 3-2 n -f- 1 by the law of quadratic reciprocity, then use this 
number in place of “3” in the above test. If n mod 4^0, the number 5 can be used. It 
turns out that 3-2 n -|— 1 is prime when n = 1,2, 5, 6 , 8 , 12, 18, 30, 36, 41, 66 , 189, 201, 
209, 276, 353, 408, 438, 534, 2208, 2816, 3168, 3189, 3912; and 5-2 n + 1 is prime when 
n = 1, 3, 7, 13, 15, 25, 39, 55, 75, 85, 127, 1947. See R. M. Robinson, Proc. Amer. 
Math. Soc. 9 (1958), 673-681; the additional numbers listed here were found by M. F. 
Plass.] 

28. f(p,p 2 d) — 2/(p -f 1) -f" f(p,d)/p, since l/(p 1) is the probability that A is 

a multiple of p. f(p,pd) = 1 /(p -f- 1) when dmodp ^ 0. /(2,4fc -j- 3) = ^ since 

A 2 — (4k -)- 3 )B 2 cannot be a multiple of 4; /(2, 8 /c -f- 5) = § since A 2 — (8k -|- 5 )B 2 
cannot be a multiple of 8 ; /( 2 , 8k —|— 1 ) = 5 -j- 5 —)- ^ 5 4 " * ‘ ‘ == 5 - f(Pi ^0 := 

( 2 p/(p 2 — 1 ), 0 ) if d (p—1)/2 modp = (1 ,p — 1 ), respectively, for odd p. 

29. The number of solutions to the equation a?i -)- \-x m < r in nonnegative integers 

Xi is ( m ^ r ) > m r fr\, and each of these corresponds to a unique integer p\ l ... p ^ 71 < n. 
[For sharper estimates, in the special case that p 3 is the j th prime for all j, see N. G. 
de Bruijn, Indag. Math. 28 (1966), 240-247; H. Halberstam, Proc. London Math. Soc. 
(3) 21 (1970), 102-107.] 

30. If p\ l ...p e m = x 2 (modulo q t ), we can find yi such that p^L.-p^p = (ip *) 2 
(modulo q^), hence by the Chinese remainder theorem we obtain 2 d values of X such 
that X 2 = pi 1 ... p\^ (modulo N). Such (ei,..., e m ) correspond to at most ( r / 2 ) pairs 
(e[,..., e' m ; e”, ..., e^) having the hinted properties. Now for each of the 2 d binary 
numbers a — (ai... 0 , 4 ) 2 , let n a be the number of exponents (ei,...,e^) such that 

(pj 1 .. .p^) (92— 1)/2 = (—l) flt (modulo qi); we have proved that the required number 
of integers X is > 2 d (£ a nJ)/( r / 2 )- Since £ o n a is the number of ways to choose at 

most r/2 objects from a set of m objects with repetitions permitted, namely ( m ^/ 2 /2 )’ 
we have Y^ a n * — ( m ^/ 2 /2 ) 2 / 2<i ^ m r /(2 d (r/2)!). [Cf. J. Number Theory, to appear.] 
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31. Set r = M, qM — 4ra, eM = 2m. 

32. Let M — and let the places x% of each message be restricted to the range 

0 < x < M 3 — M 2 . If x > M, encode it as x 3 modA^ as before, but if x < M 
change the encoding to ( x -f- yM) 3 mod IV, where y is a random number in the range 
M 2 — M < y < M 2 . To decode, first take the cube root; and if the result is M 3 — M 2 
or more, take the remainder mod M. 

34. Let P be the probability that x m mod p = 1 and let Q be the probability that 
x m mod <7 = 1. The probability that gcd(z m — 1 , N) = p or <7 is P( 1—Q) + Q( 1 — P) = 
P + Q — 2 PQ. If P < \ or Q < this probability is > ^ max(P, Q) > ^ 10 6 , so 
we have a good chance of finding a factor after about 10 6 logm arithmetic operations 
modulo N. On the other hand if P > \ and Q > \ then P Q ^ 1 , since we have 
the general formula P = gcd(m, p — l)/p; thus m is a multiple of lcm(p — 1, q — 1) in 
this case. Let m = 2 k r where r is odd, and form the sequence PmodiV, x 2r modN, 
.. •, x 2 r mod N; with high probability we find that the first appearance of 1 is preceded 
by a value y other than N — 1, as in Algorithm P, hence gcd (y — 1, N ) = p or q. 

35. Let f — [p q ~ 1 — x )mod N. Since pmod4 — qmod4 = 3, we have (^r) = 

(=!) = (^) = — (^) = —1, and we also have (J) = —(J) = 1 . Given a message x in 
the range 0 < x < |(AT — 5), let x = Ax + 2 or 8 x + 4, whichever satisfies (^) = 1; 
then transmit the message x 2 mod N. 

To decode this message, we first use a SQRT box to find the unique number y 
such that y 2 = x 2 mod N and (^) = 1 and y is even. Then y = x, since the other 
three square roots of x 2 are N — x and (±fx) mod JV; the first of these is odd, and 
the other two have negative Jacobi symbols. The decoding is now completed by setting 
x = \_y/A\ if ?/mod4 = 2, otherwise x +- [y/ 8 j. 

Anybody who can decode such encodings can also find the factors of N, because 
the decoding of a false message x 2 modN when (^) = —1 reveals (±/) mod TV, a 
number that has a nontrivial gcd with N. 

36. The rath prime equals 

ralnra -|- rain In ra — ra -f- ralnlnra/lnra — 2 ra/lnra -f- 0 (ra(loglogra) 2 (logra) -2 ), 

by (4), although for this problem we need only the weaker estimate p m = ra In ra + 
O(raloglogra). (We will assume that p m is the rath prime, since this corresponds to the 
assumption that V is uniformly distributed.) If we choose lnra = ^cy/lnNln In N, 
where c = 0(1), we find that r — c~ 1 ^/ ln N /In In N — c~ 2 — c ~ 2 (lnlnln AT/lnln AT)— 
2c —2 (ln \ c)/ln In N -f* 0(\J In \nN/\nN ). The estimated running time (21) now simp¬ 
lifies somewhat surprisingly to the formula exp(/(c, N)\f\n ATln In AT -j- 0(loglog N)), 
where we have /(c, N) = c —(— (l —(1 +ln 2)/lnln N)c ~The value of c that minimizes 
/(c, N) is y/l — (1 + In 2)/ln In IV, so we obtain the estimate 

exp(2\/ln W In In Ny/l — (1 + ln2)/lnln N -(- 0(log log AT)). 

When N — 10 50 this gives e(N) ^ .33, which is still much larger than the observed 
behavior. 

Note: The partial quotients of \[D seem to behave according to the distribution 
obtained for random real numbers in Section 4.5.3. For example, the first million partial 
quotients of the number 10 18 + 314159 include exactly (415236, 169719, 93180, 58606) 



616 ANSWERS TO EXERCISES 


4.5.4 


cases where A n is respectively ( 1 , 2 , 3, 4). Since V n lies between (\/Z)-f- l)/(A n + 1) and 
2 y/D/An, it is likely that V n < 2 y/Dy with probability about lg(l + y). This is not 
much different from a uniform distribution, so something besides the size of V n must 
account for the unreasonable effectiveness of Algorithm E. 

37. Apply exercise 4.5.3-12 to the number \[D -j- R, to see that the periodic part 
begins immediately, and run the period backwards to verify the palindromic property. 
[It follows that the second half of the period gives the same V’& as the first, and 
Algorithm E could be shut down earlier by terminating it when U = U' or V — V' in 
step E5. However, the period is generally so long, we never even get close to halfway 
through it, so there is no point in making the algorithm more complicated.] 

38. [Inf. Proc. Letters 8 (1979), 28-31.] Note that xmody = x — y[x/y\ can be 

computed easily on such a machine, and we can get simple constants like 0 = x — x, 
1 — [x/xj, 2 — 1 1 ; we can test x > 0 by testing whether x = 1 or [x/(x— 1 )J + 0 . 

(a) First compute l — [lgnj in O(logn) steps, by repeatedly dividing by 2; at the 

same time compute k = 2 l and A <— 2 2 +1 in O(logn) steps by repeatedly setting 
k <— 2 k, A <— A 2 . For the main computation, suppose we know that t = A m , 
u = (A -{- l) w , and v = ml; then we can increase the value of m by 1 by setting 
m <— m + 1 , t <— At, u +- (A -{- 1 )u, v 4 — vm; and we can double the value of m by 
setting m 4— 2 m, u 4— u 2 , v 4- {[u/t\ mod A)v 2 , t 4— t 2 , provided that A is sufficiently 
large. (Consider the number u in radix-A notation; A must be greater than ( 2 T ^ 1 ).) 

Now if n = (ai... 00 ) 2 , let n d = (a*... a,j) 2 ] if m — rij and k = 2° and j > 0 we can 
decrease j by 1 by setting k 4— [fc/ 2 j, m 4— 2m + ([n//cJ mod 2). Hence we can compute 
rij\ for j = l, l — 1 , ..., 0 in O(logn) steps. [Another solution, due to Julia Robinson, 
is to compute n! = [B n /(^)j when B > (2n) n+1 ; cf. AMM 80 (1973), 250-251, 266.] 

(b) First compute A = 2 2 as in (a), then find the least k > 0 such that 
2 fc+1 !modn = 0. If gcd(n, 2 fc !) 7 ^ 1, let f(n ) be this value; note that this gcd can 
be computed in O(logn) steps by Euclid’s algorithm. Otherwise we will find the least 
integer m such that (^ 2 j) modn = 0, and let f(n) = gcd(m, n). (Note that in this 
case 2 k < m < 2 fc+1 , hence [m/ 2 ] < 2 k and [m/ 2 ]! is relatively prime to n; therefore 
{[rn/ 2 ]} m °dn ~ 0 iff mlinodn = 0. Furthermore n 7 ^ 4.) 

To compute m with a bounded number of registers, we can use Fibonacci numbers 
(cf. Algorithm 6 .2.IF). Suppose we know that s = F 0 , s' = Fj+ 1 , t = A Fj , 
t' = A F i+\ u — {A l) 2 ^, u' = (A + l) 2 F i+\ v = A m , w = {A + l) 2m , 

mod n 7 ^ 0, and — 0. It is easy to reach this state of affairs with 

m — Fj+ 1 , for suitably large j, in O(logn) steps; furthermore A will be larger than 
2 2 ( to+s ). If s = 1, we set f(n) — gcd(2m + 1,n) or gcd(2m + 2,n), whichever is 
7 ^ 1, and terminate the algorithm. Otherwise we reduce j by 1 as follows: Set r 4— s, 
s 4— s' — s, s 4— r, r 4- t, t 4— [t'/t\, t' 4— r, r 4— u, u 4— \u'/u\, u' r; then if 
([wu/vt\ mod A) modn 7 ^ 0 , set m 4— m s, w 4— wu, v 4— vt. 

[Can this problem be solved with fewer than O(logn) operations? Can the smallest, 
or the largest, prime factor of n be computed in 0 (log n) operations?] 


SECTION 4.6 


1. 9i 2 +71 + 9; 5x 3 + 7x 2 + 2x + 6. 
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2 . (a) True, (b) False if the algebraic system S contains “zero divisors,” nonzero 
numbers whose product is zero, as in exercise 1 ; otherwise true, (c) True when m n, 
but false in general when m — n, since the leading coefficients might cancel. 

3. Assume that r < s. For 0 < k < r the maximum is mim 2 (k -|- 1); for r < k < s 
it is mim 2 (r -f- 1); for s < k < r s it is mim 2 (r -j- s -f- 1 — k). The least upper 
bound valid for all k is mirror -{- 1 ). (The solver of this exercise will know how to 
factor the polynomial x 7 -\~ 2x 6 -\- 3x 5 4“ 3x 4 -f- 3a : 3 -f- 3x 2 + 2x + 1.) 

4. If one of the polynomials has fewer than 2 t nonzero coefficients, the product can be 
formed by putting exactly t — 1 zeros between each of the coefficients, then multiplying 
in the binary number system, and finally using a logical AND instruction (present on 
most binary computers, cf. Section 4.5.4) to zero out the extra bits. For example, if 
t = 3, the multiplication in the text would become ( 1001000001)2 X ( 1000001001)2 = 
( 1001001011001001001 ) 2 ; if we AND this result with the constant (1001001... 1001 ) 2 , the 
desired answer is obtained. A similar technique can be used to multiply polynomials 
with nonnegative coefficients, when it is known that the coefficients will not be too 
large. 

5. Polynomials of degree < 2 n can be represented as Ui(x)x n -\-Uq(x) where deg(Ui) 
and deg(t/o) < n; and ( Ui(x)x n Uo(x))(Vi(x)x n -\- Vo(x)) — Ui(x)Vi(x)(x 2n + x n ) ~\~ 
(1/1(2:) + t/o( 2 :))(Vi(x) 4 “ Vo(2:))x n -\- Uq{x)V o(x)(x n -(- 1 ). (This equation assumes that 
arithmetic is being done modulo 2.) Thus Eqs. 4.3.3-3, 4, 5 hold. 

Notes: S. A. Cook has shown that Algorithm 4.3.3C can be extended in a similar 
way, and exercise 4.6.4-57 describes a method requiring even fewer operations for 
large n. But these ideas are not useful for “sparse” polynomials (having mostly zero 
coefficients). 


SECTION 4.6.1 

1 . q{x) = 1 • 2 3 2; 3 + 0 ■ 2 V — 2 • 2x + 8 = 8 z 3 — 4x + 8 ; r(x) = 28x 2 + 4x + 8 . 

2. The monic sequence of polynomials produced during Euclid’s algorithm has the 
coefficients (1, 5, 6 , 6 ,1, 6 , 3), (1, 2, 5, 2, 2, 4, 5), ( 1 , 5, 6 , 2, 3, 4), (1,3, 4, 6 ), 0. Hence the 
greatest common divisor is x 3 -(- 3x 2 -(- 4x -|- 6 . (The greatest common divisor of a 
polynomial and its reverse is always symmetric, in the sense that it is a unit multiple 
of its own reverse.) 

3. The procedure of Algorithm 4.5.2X is valid, with polynomials over S substituted 
for integers. When the algorithm terminates, we have U(x) = 1 * 2 ( 2 :), V(x) = ui(x). Let 
m = deg(u), n — deg(-y). It is easy to prove by induction that deg(u 3 ) + deg(fi) = n, 
deg(u 3 ) -(- deg(^ 2 ) = m, after step X3, throughout the execution of the algorithm, 
provided that m > n. Hence if m and n are greater than d = deg(gcd(u, i>)) we have 
deg (U) < m — d, deg(T) < n — d; the exact degrees are m — d\ and n — di, where d\ 
is the degree of the second-last nonzero remainder. If d = min(m, n), say d = n, we 
have U(x) = 0 and V(x) = 1. 

When u(x) = x m — 1 and v(x) = x n — 1, the identity (x m — l)mod(x n — 1) == 
x m mod n — 1 shows that all polynomials occurring during the calculation are monic, 
with integer coefficients. When u(x) = x 21 — 1 and v(x) = x 1 3 — 1 , we have V(x) — 
x 11 -f x 8 + + x 3 + 1 and U(x) = —(x 19 4- £ 16 4- % 14 + x 11 + 2: 8 + x 6 -f x 3 -f x). 

[See also Eq. 3.3.3-29, which gives an alternative formula for U(x) and V'(x).] 
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4. Since the quotient q(x) depends only on v(x) and the first m — n coefficients of 
u(x ), the remainder r(x) = u(x) — q(x)v(x) is uniformly distributed and independent of 
v(x). Hence each step of the algorithm may be regarded as independent of the others; 
this algorithm is much more well-behaved than Euclid’s algorithm over the integers. 

The probability that n\ = n — A; is p 1—fc ( 1 — 1/p), and t = 0 with probability 
p~ n . Each succeeding step has essentially the same behavior; hence we can see that 
any given sequence of degrees n, ni, ..., n t , — oo occurs with probability (p — if/p n . 
To find the average value of /(ni,..., n t ), let S t be the sum of f(ni ,..., n t ) over all 
sequences n>ni>--->n t >0 having a given value of t; then the average is 

e, st(p -1 r/ P n . 

Let f(ni,... ,n t ) = t; then St — (")(£ + 1), so the average is n(l — 1 /p). 
Similarly, if /(ni ,... ,n t ) — m -f • • ■ -f- n t , then S t = (^(t— 1 )> and average is 

(J)(l — 1 /p). Finally, if f(n i,...,n t ) = (n —ni)ni -\ - {~{n t -i — n t )n t , then S t = 

(t+ 1) ~ ( n + + { n t 1 )(i). and the average is ("t 1 ) — (n + l)p/(p — 1 ) + 

(p/(p-l)) 2 (l-l/p n+1 ). 

As a consequence we can see that if p is large there is very high probability that 
rij- i-i = rij — 1 for all j. (If this condition fails over the rational numbers, it fails for 
all p, so we have further evidence for the text’s claim that Algorithm C almost always 
finds S 2 = ■■■ = !.) 

5. Using the formulas developed in exercise 4, with f(ni,... ,n t ) = 6 nt o, we find that 
the probability is 1 — 1 /p if n > 0 , 1 if n = 0 . 

6 . Assuming that the constant terms u(0) and v(0) are nonzero, imagine a “right-to- 
left” division algorithm, u(x) — v(x)q(x) -j- x m ~ n r(x), where deg(r) < deg(v). We 
obtain a gcd algorithm analogous to Algorithm 4.5.2B, which is essentially Euclid’s 
algorithm applied to the “reverse” of the original inputs (cf. exercise 2), afterwards 
reversing the answer and multiplying by an appropriate power of x. 

7. The units of S (as polynomials of degree zero). 

8 . If w(x) = v(x)w(x), where u(x) has integer coefficients while v(x) and w[x) have 
rational coefficients, there are integers m and n such that m • v(x) and n • w(x) have 
integer coefficients. Now u(x) is primitive, so we have 

u(x) = i pp(m • v(x))pp(n • u>(z)). 

9. We can extend Algorithm E as follows: Let (ui(x), U 2 {x), 113 , U 4 {xj) and [v\{x) ) V 2 {x), 
V 3 ,V 4 (x)) be quadruples that satisfy the relations wi(x)w(x) -(- U 2 (x)v(x) — U 3 U 4 {x), 
v\(x)u(x) -\- V 2 (x)v(x) = V 3 V 4 {x). The extended algorithm starts with the quadruples 
(l, 0 , cont(it), pp(u(x))) and ( 0 , 1 , cont(v), pp(v(x))) and manipulates them in such a way 
as to preserve the above conditions, where U 4 (x) and V 4 (x) run through the same 
sequence as u(x) and v(x) do in Algorithm E. If au 4 (x) — q(x)v 4 {x) -\- br(x), we have 
av 3 (ui(x), u 2 {x)) — q(x)u 3 {vi(x), v 2 (x)) = (ri(x),r 2 (x)), where ri(x)u(x) + r 2 (x)v{x) — 
bu 3 V 3 r(x), so the extended algorithm can preserve the desired relations. If u(x ) and 
v(x) are relatively prime, the extended algorithm eventually finds r{x) of degree zero, 
and we obtain U(x) = r 2 (x), V(x) — 7 r(x) as desired. (In practice we would divide 
it(x), r 2 (x), and bu 3 V 3 by gcd(cont(ri), cont(r 2 )).) Conversely, if such U(x) and V{x) 
exist, then uix) and v(x) have no common prime divisors, since they are primitive and 
have no common divisors of positive degree. 
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10. By successively factoring reducible polynomials into polynomials of smaller de¬ 
gree, we must obtain a finite factorization of any polynomial into irreducibles. The 
factorization of the content is unique. To show that there is at most one factorization 
of the primitive part, the key result is to prove that if u(x ) is an irreducible factor of 
v(x)w(x), but not a unit multiple of the irreducible polynomial v(x), then u(x ) is a fac¬ 
tor of w{x). This can be proved by observing that u(x ) is a factor of v(x)w(x)U(x) ~ 
rw{x) — iy(a:)u(x)y( 2 ;) by the result of exercise 9, where r is a nonzero constant. 

11 . The only row names needed would be A 1; Aq, B 4 , B 3 , B 2 , Bi, B 0 , C\, Co, D 0 . In 
general, let Uj+ 2 (x) = 0; then the rows needed for the proof are A n2 — nj through Aq, 
B ni —nj through Bo, C n2 — nj through Co, D n3 — n i through Do, etc. 

12. If nk = 0, the text’s proof of (24) shows that the value of the determinant is 

it hk, and this equals V IIi<^ ^ ^If the polynomials have a factor 

of positive degree, we can artificially assume that the polynomial zero has degree zero 
and use the same formula with Ik — 0 . 

Notes: The value R(u, v ) of Sylvester’s determinant is called the resultant of u and 
v , and the quantity (—l) deg(u)(deg(u)— 1 )/ 2 4 u) — u ') is called the discriminant of u, 
where u' is the derivative of u. If u(x) has the factored form a(x — a i)...(x — a m ), 
and if v(x) = b(x — (3i)...(x — p n ), the resultant R(u,v) is a n v(oti )... v{a m ) = 
(_1 ) Tnn b m u(p 1 ).. .u(p n ) — a n b m Y[ 1 <t< mil < J -< n (tt» — Pj)- It follows that the poly¬ 
nomials of degree mn in y defined as the respective resultants with v(x) of u(y — x), 
u(y -\- x), x rn u{yjx') ) and u(yx) have as respective roots the sums oti -(- pj, differences 
a t — Pj, products ouPj, and quotients otijpj (when u( 0 ) 7 ^ 0 ). This idea has been used 
by R. G. K. Loos to construct algorithms for arithmetic on algebraic numbers [SIAM 
J. Computing, to appear]. 

If we replace each row A x in Sylvester’s matrix by 

(boAi -(- fc>i •At-i-i + ••• + bnz — 1 — {An 2 — 1 ) {cloBi -\~ < 2 iB{+i + ••• + flri 2 — 1 — iBn 2 — l), 

and then delete rows B n2 _ 1 through B 0 and the last n 2 columns, we obtain an ni X 
determinant for the resultant instead of the original ( n\ -\-n 2 ) X (ni -\-n 2 ) determinant. 
In some cases the resultant can be evaluated efficiently by means of this determinant; 
see CACM 12 (1969), 23-30, 302-303. 

J. T. Schwartz has shown that it is possible to evaluate resultants and Sturm 
sequences for polynomials of degree n with a total of 0 (n(logn) 2 ) arithmetic operations 
as n —► 00 . [JACM, to appear.] 

13. One can show by induction on j that the values of (uj- j-i(x), gj+i,hj) are replaced 
respectively by (l 1+p J w(x)uj(x), £ 2Jrp i g Jt hj ) for j > 2 , where pj = m-\-n 2 — 2 rij. 
[In spite of this growth, the bound (26) remains valid.] 

14. Let p be a prime of the domain, and let j, k be maximum such that p k \v n = 
t( v ), p 3 \ v n — 1 . Let P = p k . By Algorithm R we may write q(x) — ao + Pd\X -f- 
• • • -j- P s a s x s , where s = m — n > 2. Let us look at the coefficients of x n+1 , x n , 

and x n ~ 1 in v(x)q(x), namely PaiV n + P 2 a 2 v n —i H-, doV n -f Pa\V n —i -}-, and 

aov n —1 P Ui Vn _ 2 H-, each of which is a multiple of P 3 . We conclude from the first 

that p J \a 1 , from the second that p min ( fc > 2 J) \ a0j then from the third that P\a 0 . Hence 
P \ r(x). [If m were only n -f-1, the best we could prove would be that p^ k/2 ~\ divides 
r(x); e.g., consider u(x) = x 3 1 , v(x) — 4x 2 -(- 2x + 1, r(x) = 18. On the other hand, 
an argument based on determinants of matrices like ( 21 ) and ( 22 ) can be used to show 
that f(r) deg(lj) - deg( ^- 1 r(x) is always a multiple of ^)( d «g^)- d «g(^))(deg(v)-de g (r)-i) j 
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15. Let Cij — Q'iiQ'ji —l - ■ ■ ■ din.O'jn] we may assume that ca 0 for all %. If Cij ^ 0 
for some i ^ j, we can replace row i by (an — caji ,..., a in — caj n ), where c = Cij/Cjj ; 
this does not change the value of the determinant, and it decreases the value of the 
upper bound we wish to prove, since cu is replaced by ca — c 2 j /c l j. These replacements 
can be done in a systematic way for increasing i and for j < i, until Cij = 0 for all 
i yt j. [The latter algorithm is called “Schmidt’s orthogonalization process”; see Math. 
Annalen 63 (1907), 442.] Then det(A) 2 = det(AA T ) = c u ... c nn . 

16. Let f(x 1 , . . . , X n ) 2 ) • • • ; X'Ti'jd'i ~ I - ‘ ’ ' -1 - Qo(Z 2 } • ■ • j %n) and g(X 2 i • • • ; %n) 

g m (x 2 ,..., i n ) 2 + • • • + <?o(z 2 , • • •, £n) 2 ; the latter is not identically zero. We have 

fC m(2N -J- 1) -j- (2 N -f-1 )&n, 

where counts the integer solutions of g{x 2 ,..., x n ) = 0 with variables bounded 
by N. Hence lim/v^oo a N /(2N -f l) n = lim^-co b N /(2N + l) n_1 , and this is zero by 
induction. 

17. (a) For convenience, let us describe the algorithm only for A = {a, b}. The hy¬ 
potheses imply that deg(<5iL r ) = deg^U) > 0, deg(Qi) < deg(<32). If deg(Qi) = 0, 
then Qi is just a nonzero rational number, so we set Q = Q 2 /Q 1 ■ Otherwise we let 
Qi = a Qu + bQ 12 + tt, Q 2 = aQ 2 1 -f- bQ 22 + r 2 , where n and r 2 are rational 
numbers; it follows that 

QiU - Q 2 V = a{Q u U - Q 21 V ) + b{Q 12 U - Q 22 V ) + r x U - r 2 V. 

We must have either deg(Qn) = deg(Qi) — 1 or deg(Qi 2 ) = deg(Qi) — 1. In the 
former case, deg (Q 11 U — Q 21 V) < deg(Qul/), by considering the terms of highest 
degree that start with a; so we may replace Q 1 by Qn, Q 2 by Q 2 1 , and repeat the 
process. Similarly in the latter case, we may replace (Q\, Q 2 ) by {Q 12 , Q 22 ) and repeat 
the process. 

(b) We may assume that deg((7) > deg(V). If deg(i?) > deg(U), note that 
Q\U — Q 2 V = QiR — (Q 2 — QiQ)V has degree less than deg(U) < deg(Qi/?), so 
we can repeat the process with U replaced by R; we obtain R = Q'V + R', U = 
(Q -j- Q')V -j- R ', where deg(i?') < deg(H), so eventually a solution will be obtained. 

(c) The algorithm of (b) gives Vi — UV 2 - } r R, deg(/2) < deg(y 2 ); by homogeneity, 
R — 0 and U is homogeneous. 

(d) We may assume that deg(U) < deg (U). If deg(U) = 0, set W <— U; otherwise 
use (c) to find U = QV, so that QVV = VQV, (QV — VQ)V — 0. This implies that 
QV = VQ, so we can set U *— V, V <— Q and repeat the process. 

For further details about the subject of this exercise, see P. M. Cohn, Proc. 
Cambridge Phil Soc. 57 (1961), 18-30. The considerably more difficult problem of 
characterizing all string polynomials such that UV = VU has been solved by G. M. 
Bergman [Ph.D. thesis, Harvard University, 1967]. 

18. [P. M. Cohn, Transactions Amer. Math. Soc. 109 (1963), 332-356.] 

Cl. Set ui «- Ui, u 2 *- U 2 , vi «- Vi, v 2 <- U 2 , zi <- z 2 <- w x <- w’ 2 <— 1, z[ <- z 2 4 - 
w[ w 2 <— 0, n 4 — 0. 

C2. (At this point the identities given in the exercise hold, and u x v x — u x v 2 \ v 2 = 0 if 
and only if u x — 0.) If v 2 = 0, the algorithm terminates with gcrd(Vi, V 2 ) = ^ 1 , 
lclm(Vi, V 2 ) = z[V 1 = —z 2 V 2 . (Also, by symmetry, we have gcld(Ui, U 2 ) — u 2 
and lcrm(Ui, U 2 ) = U x wi = — U 2 w 2 .) 
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C3. Find Q and R such that vi = Qv 2 + R, where deg(i?) < deg(t> 2 ). (We have 
Ui(Qv 2 *4* R) — U 2 V 2 , so uiR = (u 2 — uiQ)v 2 — R'v 2 .) 

C4. Set (ti/i, v>2, w'i, w'2, zi, z 2 , z[, z 2 , u lf u 2 , Vi, v 2 ) «- (w[ — wiQ, w 2 — w 2 Q, wi, 
w 2 , z' lt z 2 , zi — Qz' 1} z 2 — Qz 2 , u 2 — uiQ, u\, V2, v\ — Qv 2 ) and n <- n-f- 1. Go 
back to C2. | 

This extension of Euclid’s algorithm includes most of the features we have seen in 
previous extensions, all at the same time, so it provides new insight into the special cases 
already considered. To prove that it is valid, note first that deg(v 2 ) decreases in step 
C4, so the algorithm certainly terminates. At the conclusion of the algorithm, iq is a 
common right divisor of Vi and V 2 , since w\V\ — (—l) n Vi and — w 2 v 1 = (—l) n V 2 ; also 
if d is any common right divisor of Vi and V 2 , it is a right divisor of z 1 V 1 -\-z 2 V 2 — v\. 
Hence vi = gcrd(Vi, V 2 ). Also if m is any common left multiple of Vi and V 2 , we may 
assume without loss of generality that m = t/iVi — U 2 V 2 , since the sequence of values 
of Q does not depend on lh and U 2 . Hence m = (—l) n (— u 2 z[)Vi = (— i) n {u 2 z' 2 )V 2 
is a multiple of z[Vi. 

In practice, if we just want to calculate gcrd(Vi, V 2 ), we may suppress the com¬ 
putation of n, wi, w 2 , w[, w 2 , zi, z 2 , z' 1} z' 2 . These additional quantities were added 
to the algorithm primarily to make its validity more readily established. 

Note: Nontrivial factorizations of string polynomials, such as the example given 
with this exercise, can be found from matrix identities such as 


(a 


(0 

lye 


(° M 

(° 



a 

0\ 

L 

0/ 

ll 

0 A 1 

oJ 

Vl — c) 

u 

-b) 

Vl —a) 

lo 

1 / 


since these identities hold even when multiplication is not commutative. For example, 

(abc -\- a 4- c)(l + ba ) — (ab -f* l)(c 6 a -j- a -j- c). 

(Compare this with the “continuant polynomials” of Section 4.5.3.) 

19. [Cf. Eugene Cahen, Theorie des Nombres 1 (Paris: A. Hermann & fils, 1914), 
336-338.] If such an algorithm exists, D is a gcrd by the argument in exercise 18. Let 
us regard A and B as a single 2 n X n matrix C whose first n rows are those of A, and 
whose second n rows are those of B. Similarly, P and Q can be combined into a 2n X n 
matrix R; X and Y can be combined into an n X 2n matrix Z. The desired conditions 
now reduce to two equations C — RD, D — ZC. If we can find a 2n X 2n integer 
matrix U with determinant J^l such that the last n rows of U~ l C are all zero, then 
R — (first n columns of U), D — (first n rows of U~ 1 C), Z = (first n rows of U~ l ) 
solves the desired conditions. Hence, for example, the following algorithm may be used 
(with m = 2n): 

Algorithm T ( Triangularization ). Let C be an m X n matrix of integers. This algorithm 
finds mXm integer matrices U and V such that UV — I and VC is “upper triangular.” 
(The entry in row i and column j of VC is zero if i > j.) 

Tl. [Initialize.] Set U +~ V «— I, the m X m identity matrix; and set T <— C. 

(Throughout the algorithm we will have T — VC and UV = 1.) 

T2. [Iterate on j.] Do step T3 for j — 1,2,..., min(m, n), and terminate the algorithm. 
T3. [Zero out column j.} Perform the following transformation zero or more times until 
Tij is zero for all i > j: Let 7V,- be a nonzero element of {T %3} ..., T m j} 

having the smallest absolute value. Interchange rows k and j of T and of V; 
interchange columns k and j of U. Then subtract \ Tij(T 3J \ times row j from row 
i, in matrices T and V, and add the same multiple of column i to column j in 
matrix U, for j < i < m. | 
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For the stated example, the algorithm yields (3 4) = (3 2 X 0 _i)» (2 1 ) = (2 3 X 0 — 1 )» 
(0 _i) = (2 — 2 X 3 4) 4 (1 0)(2 i)- (Actually any matrix with determinant ±1 would 
be a gcrd in this particular case.) 

20. It may be helpful to consider the construction of exercise 4.6.2-22, with p m replaced 
by a small number e. 

21. Note that Algorithm R is used only when m — n < 1; furthermore, the coefficients 
are bounded by (25) with m — n. [The stated formula is, in fact, the execution time 
observed in practice, not merely an upper bound. For more detailed information see 
G. E. Collins, Proc. 1968 Summer Inst, on Symbolic Math. Comp., Robert G. Tobey, 
ed. (IBM Federal Systems Center: June 1969), 195-231.] 

22. A sequence of signs cannot contain two consecutive zeros, since Ufc+i(x) is a 
nonzero constant in (29). Moreover we cannot have “-[-, 0, -[-” or “—, 0, —” as 
subsequences. The formula V(u, o) — V(u, b ) is clearly valid when b = a, so we must 
only verify it as b increases. The polynomials Uj(x) have finitely many roots, and V(u, b ) 
changes only when b encounters or passes such roots. Let x be a root of some (possibly 
several) u 0 . When b increases from x — e to x, the sign sequence near j goes from “-[-, 
4, —” to “4, 0, —” or from “—, 4> 4” to “—, 0, 4” if j > 0; and from “+, —” 
to “0, —” or from “—, 4” to “0, 4” if j = 0. (Since u'(x) is the derivative, u'(x ) is 
negative when u(x) is decreasing.) Thus the net change in V is —6j 0 . When b increases 
from x to x 4 e, a similar argument shows that V remains unchanged. 

[L. E. Heindel, JACM 18 (1971), 533-548, has applied these ideas to construct 
algorithms for isolating the real zeros of a given polynomial u(x), in time bounded by 
a polynomial in deg(u) and log N, where all coefficients yj are integers with \uj | < N, 
and all operations are guaranteed to be exact.] 


23. If v has n —1 real roots occurring between the n real roots of u, then (by considering 
sign changes) u(x)mod^(x) has n — 2 real roots lying between the n — 1 roots of v. 


24. First show that hj = " 2 ( 1— B ^- Then show that 

the exponent of #2 on the left-hand side of (18) has the form 82 4 61 x, where x = 
82 4 ■ ■ ■ —[- 8 j —1 -(-1 — 62(63 -(-•*■ 4 6 j —1 + 1 ) — 63(1 — 6 2 )( 6 4 4 * • * 4 6 j_i 4 1 ) — 
• • • — 8 j—\{\ — 62 )... (1 — 6 j_ 2 )( 1 ). But x = 1, since it is seen to be independent of 
6 j —1 and we can set 6 j—\ = 0 , etc. A similar derivation works for <73, g 4 , ..., and a 
simpler derivation works for (23). 


25. Each coefficient of Uj(x) can be expressed as a determinant in which one column 
contains only £(u), £(v), and zeros. To use this fact, modify Algorithm C as follows: 
In step Cl, set g <— gcd(^(u), i(v)) and h <— 0. In step C3, if = 0, set u(x) «— u(x), 
v(x) ■*— r{x)/g, h <— £(u) 6 /g, g +— t(u), and return to C 2 ; otherwise proceed as in the 
unmodified algorithm. The effect of this new initialization is simply to replace Uj(x) 
by Uj(x)/gcd(£(u), £(v)) for all j > 3; thus, will become in (28). 


26. In fact, even more is true. Note that the algorithm in exercise 3 computes 
zbPn(x) and ^ q n (x) for n > —1. Let e n = deg(q n ) and d n = deg(p n u — q n v ); we 
observed in exercise 3 that d n — 1 + e n — deg(u) for n > 0. We shall prove that the 
conditions deg(q) < e n and deg(pu — qv ) < d n —2 imply that p(x) = c{x)p n —\{x) and 
q(x) — c(x)q n — i(x): Given such p and q, we can find c(x) and d(x) such that p(x) = 
c(x)p n _i(x) 4 - d(x)p n (x) and q(x) = c(x)q n - i(x) + d(x)q n (x), since p n —\{x)q n {x) — 
Pn{x)q n —i(x) = ±1. Hence pu — qv — c(p n —iU — q n -iv)-\-d(p n u — q n v). If d(x) 7 ^ 0, 
we must have deg(c) + e n —1 = deg(d) -f- e n , since deg(q) < deg(q n ); it follows that 
deg(c) -[- dn -1 > deg(d) 4 d n , since this is surely true if d n = — 00 and otherwise we 
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have dn —1 +e n — d n + e n +i > d n -\-e n — i- Therefore deg (pu — qv) = deg(c) -f- d n — 1 ; 
but we have assumed that deg(pu — qv) < d n _ 2 = d n -i + e n — e n — 1 , so deg(c) < 
e n — e n —i and deg(d) < 0, a contradiction. 

[This result is essentially due to L. Kronecker, Monatsberichte Konigl. Preufi. 
Akad. Wiss. (Berlin: 1881), 535-600. It implies the following theorem: “Let u(x) 
and v(x) be relatively prime polynomials over a field and let d < deg(v) < deg (u). If 
q(x) is a polynomial of least degree such that there exist polynomials p(x) and r(x) 
with p(x)u(x) — q(x)v(x) = r(x) and deg(r) = d, then p{x)/q{x) = p n {x)/q n (x) 
for some n.” For if d n —2 > d > d n — 1 , there are solutions q(x) with deg(q) = 
e n —i + d — d n — 1 < e n , and we have proved that all solutions of such low degree have 
the stated property.] 


SECTION 4.6.2 

1. By the principle of inclusion and exclusion (Section 1.3.3), the number of poly¬ 
nomials without linear factors is J2k<n {l)p n ~ k (~ l)* — P n ~ p {p — l) p - The stated 
probability when n > p is therefore 1 — (1 — l/p) p , which is greater than [In fact, 
the stated probability is greater than \ for all n > 1 .] The average number of linear 
factors is p times the average number of times x is a factor, so it is ^x(l — p~ n ) = 
l -\-p~ 1 + •••-{- p 1—n . [See answer 38 for further comments on the average number 
of linear factors.] 

2 . (a) We know that u(x) has a representation as a product of irreducible polynomials; 
and the leading coefficients of these polynomials must be units, since they divide the 
leading coefficient of u(x). Therefore we may assume that u{x) has a representation as 
a product of monic irreducible polynomials pi{x ) ei .. .p r {x) er , where pi(x), ..., p r {x) 
are distinct. This representation is unique, except for the order of the factors, so the 
conditions on u(x), v(x), w(x) are satisfied if and only if 

v(x) = Pl(j) Le ' /2J . . . p r (l) Le ' /2J , w(x) = pi (x) ei m ° d 2 . .. Pr(x) Cr m ° d 2 . 

(b) The generating function for the number of monic polynomials of degree n is 

1 -\-pz-\-p 2 z 2 -\ -= 1/(1— pz ). The generating function for the number of polynomials 

of degree n having the form v{x) 2 , where v(x) is monic, is 1 + pz 2 -\- p 2 z 4 -j- ■ ■ ■ = 
1/(1 —pz 2 ). If the generating function for the number of monic squarefree polynomials 
of degree n is q(z), then by part (a) we must have 1/(1 — pz) = g{z)/{ 1 — pz 2 ). Hence 

g(z) = (1 — pz 2 )/(l — pz) = 1 -f - pz -\- (p 2 — p)z 2 -f- (p 3 — p 2 )z 3 -|-. The answer is 

p n — p 71 ” 1 for n > 2. [Curiously, this proves that gcd(u(x), u'{x)) — 1 with probability 
1 — 1 /p; it is the same as the probability that gcd(u(x), v(x)) = 1 when u(x) and v{x) 
are independent, by exercise 4. 6 .1-5.] 

Note: By a similar argument, every u(x) has a unique representation v(x)iy(x) r , 
where v(x ) is not divisible by the rth power of any irreducible; the number of such 
monic polynomials v(x) is p n — p n ~ r_l ~ 1 for n > r. 

3. Let u(x) — ui(x ).. .u r (x). There is at most one such v(x), by the argument of 
Theorem 4.3.2C. There is at least one if, for each j , we can solve the system with 
Wj(x) = 1 and Wk(x) — 0 for k 7 ^ j. A solution to the latter is v\(x) Ylk^j Uk ( x )> 
where Vi(x) and u 2 (x) can be found satisfying 

Vi(s)n k^j u k{x) + v 2 {x)uj{x) — 1 , deg(vi) < deg (uj), 
by the extension of Euclid’s algorithm (exercise 4.6.1-3). 
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4. By unique factorization, we have (1 — pz) 1 = f| n>1 (l — ^ n ) anp J a ft er taking 
logarithms, this can be rewritten 

E>>i G P {z j )/j = a*p**7i = ln(l/(l —pz)). 

The stated identity now yields the answer G p (z ) — E m >i p{m)m ~ 1 ln(l/(l — pz m )), 

from which we obtain a np — E d \n p{n/d)n~ 1 p d ] thus lim p ^ a np /p n = 1/n. To 
prove the stated identity, note that 

E nj >i = E m >i E„ Xm mW = 0 ( 2 ). 

5. Let a np r be the number of monic polynomials of degree n modulo p having exactly 
r irreducible factors. Then Q p {z,w) = E n ,r>o a npr z n w r = exp(E fc >i G p (z k )w k /k), 
where G p is the generating function in exercise 4; cf. Eq. 1.2.9-34. We have 

E n >o A npZ n = dg p {z/p,w)/dw\ w = 1 - (Efc>i G p(^ fc /P fc ))exp(ln(l/(1 — z))) 

= (En>l lll ( 1 /( 1 — P 1_n 2 n )M n )/ n )/( 1 ~ 4 

hence A np = H n + l/2p-(- 0(p ~ 2 ) for n > 2. The average value of 2 r is the coefficient 
of 0 n in Q P {z/p, 2), namely n + 1 -f (n — l)/p + 0(p -2 ). (The variance is of order n 3 , 
however: set — 4.) 

6. For 0<s<p, x — sisa factor of x p — x (modulo p) by Fermat’s theorem. So 
x p — x is a multiple of lcm(x — 0, x — 1,..., x — (p — 1)) = x-. [Note: Therefore the 
Stirling numbers [£] are multiples of p except when k = 1, k — p. Equation 1.2.6-41 

shows that the same statement is valid for Stirling numbers {£} of the other kind.] 

7. The factors on the right are relatively prime, and each is a divisor of u(x), so their 
product divides u(x). On the other hand, u(x) divides 

v{x) p — v{x) = n 0 < s < P M*) — s )> 

so it divides the right-hand side by exercise 4.5.2-2. 

8. The vector (18) is the only output whose kth component is nonzero. 

9. For example, start with x <— 1 and y <— 1; then repeatedly set R[x] +- y, 
x *— 2x mod 101, y BlpmodlOl, one hundred times. 

10. The matrix Q — I below has a null space generated by the two vectors = 
(1,0,0,0,0, 0,0, 0), v [21 — (0,1,1,0,0,1,1,1). The factorization is 

(x 6 + X 5 + x 4 + x + l)(z 2 + X + 1). 
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11. Removing the trivial factor x, the matrix Q — I on the previous page has a null 
space generated by (1,0,0,0,0,0,0} and (0,3,1,4,1, 2,1). The factorization is 

x(x 2 + 3x + 4)(x 5 + 2x 4 + x 3 + 4x 2 -h x + 3). 

12. If p = 2, (x 4- l) 4 = x 4 -f 1. If p = 8 k -f 1, Q — / is the zero matrix, so there 
are four factors. For other values of p we have 

p = 8/c + 3 P = 8 /c-F 5 p=8/c + 7 

( 0 0 0 0\ /0 0 0 0\ /0 0 0 0\ 

0 —1 0 1 0 -2 0 0 11 0 —1 0 — 1 I 

0 0—2 olio 0 0 olio 0—2 Of 

0 1 0 -lAo 0 0 —2' Vo —1 0 — 1/ 

Here Q — I has rank 2, so there are 4—2 = 2 factors. [But it is easy to prove that x 4 +1 
is irreducible over the integers, since it has no linear factors and the coefficient of x in 
any factor of degree two must be less than or equal to 2 in absolute value by exercise 20. 
For all k > 2, H. P. F. Swinnerton-Dyer has exhibited polynomials of degree 2 fc that are 
irreducible over the integers, but they split completely into linear and quadratic factors 
modulo every prime. For degree 8, his example is x 8 — 16x 6 88x 4 192x 2 -f-144, 

having roots ±y/2 4 Vs 4 i [see Math. Comp. 24 (1970), 733-734]. According to the 
theorem of Frobenius cited in exercise 37, any irreducible polynomial of degree n whose 
Galois group contains no n-cycles will have factors modulo almost all primes,] 

13. p = 8 k -j- 1: (x —|— (1 —f— ■'/—!)/\/2)(x -j- (1 — y/- —1)/\/2)(x — (1 -(- \J —1)/ y2) X 
(x — (1 — \/=l)/V2). p = 8fc + 3: (x 2 — V— 2x — l)(x 2 — \/—2x — l). p = 8fc + 5: 
(x 2 — \f—~ l)(x 2 — \/—l). p = 8/c 4 7: (x 2 — \/2x -j- l)(x 2 4 \/2x 4 l). The latter 
factorization also holds over the field of real numbers. 

14. Algorithm N can be adapted to find the coefficients of w: Let A be the (r|l)Xn 
matrix whose kth. row contains the coefficients of ^(x)* modu(x), for 0 < k < r. Apply 
the method of Algorithm N until the first dependence is found in step N3; then the 
algorithm terminates with ui(x) = Vo -f- v\x -(-■•• -j- VkX k , where v 3 is defined in (18). 
At this point 2 < k < r; it is not necessary to know r in advance, since we can check 
for dependency after generating each row of A. 

15. We may assume that u ^ 0 and that p is odd. Berlekamp’s method applied to the 
polynomial x 2 — u tells us that a square root exists if and only if modp = 1; 

let us assume that this condition holds. 

Let p — 1 = 2 e • q, where q is odd. Zassenhaus’s factoring procedure suggests the 
following square-root extraction algorithm: Set t <— 0. Evaluate 

ged((x -f- t) q — 1, x 2 — u), gcd((x + t) 2q — 1, x 2 — u), 

gcd((x 4- t) 4q — 1, x 2 — u), gcd((x + tf Q — 1, x 2 — u), 

until finding the first case where the gcd is not 1 (modulo p). If the gcd is x — v, then 
y/u ~ If the gcd is x 2 — it, set t *— t 1 and repeat the calculation. 

Notes: If (x -f- t) k mod (x 2 — u) = ax-\-b, then we have (x + t) k+1 mod (x 2 — it) = 
(6 -j- at)x {bt + ait), and (x + t) 2k mod (x 2 — u) = 2 abx 4* (b 2 4- a 2 it); hence (x 4- t) q > 
(x 4* t) 2g , ■■■ are easy to evaluate efficiently, and'the calculation for fixed t takes 
0((logp) 3 ) units of time. The square root will be found when t = 0 with probability 


I 
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1 /2 e 1 ; thus it will always be found immediately when pmod4 = 3. If we choose t 
at random instead of increasing it sequentially, exercise 29 gives a rigorous proof that 
each t gives success at least about half of the time; but for practical purposes this 
random choice isn’t needed. 

Another square-root method has been suggested by D. Shanks. When e > 1 
it requires an auxiliary constant z (depending only on p) such that z 2 = —1 

(modulo p). The value z = n q modp will work for almost one half of all integers n; 
once 2 is known, the following algorithm requires no more probabilistic calculation: 

51. Set y *- 2 , r <— e, v <— vf - q+1 ^ 2 modp, w <— u Q modp. 

52. If w — 1, stop; v is the answer. Otherwise find the smallest k such that w 2 modp 
is equal to 1. If k — r, stop (there is no answer); otherwise set 


(' V,r,v,w ) 




) 


and repeat step S2. | 

The validity of this algorithm follows from the invariant congruences uw = v 2 , 
y 2 *- 1 = — 1 , w 2T 1 = 1 (modulo p). On the average, step S2 will require about \e 2 
multiplications mod p. Reference: Proc. Second Manitoba Conf. Numer. Math. (1972), 
58-62. A related method was published by A. Tonelli, Gottinger Nachrichten (1891), 
344-346. 

16. (a) Substitute polynomials modulo p for integers, in the proof for n = 1. (b) The 
proof for n = 1 carries over to any finite field, (c) Since x = £ fc for some k, x p = x 
in the field defined by /(x). Furthermore, the elements y that satisfy the equation 
y p = y in the field are closed under addition, and closed under multiplication; so if 
x p — x, then £ (being a polynomial in x with integer coefficients) satisfies £ p = £. 

17. If £ is a primitive root, each nonzero element is some power of £. Hence the order 
must be a divisor of 13 2 — 1 = 2 3 • 3 ■ 7, and <p(f) elements have order /. 


/ 

<p(f) 

/ 

rtf) 

/ 

<p(f) 

/ 

<p(f) 

1 
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2 

7 

6 

21 

12 

2 

1 
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2 

14 

6 

42 

12 

4 

2 

12 

4 

28 

12 

84 

24 

8 

4 

24 

8 

56 

24 

168 

48 


18. (a) pp(pi(u n x))... pp(p r (u n 2 :)), by Gauss’s lemma. For example, let 

u(x) = 6x 3 — 3x 2 + 2x — 1, v(x) = x 3 — 3x 2 + 12x — 36 = (x 2 + 12)(x — 3); 

then pp(36x 2 + 12) = 3x 2 -f- 1, pp(6x — 3) = 2x — 1. (This is a modern version of a 
fourteenth-century trick used for many years to help solve algebraic equations.) 

(b) Let pp (w(u n x)) — w m x m -j- • • • -f wo = w(u n x)/c, where c is the content of 

w(u n x ) as a polynomial in x. Then w(x) — (cw/u™)x m -\ -l-c^o, hence cw m = uJT; 

since w m is a divisor of u n , c is a multiple of u ™~ 1 . 

19. If u(x) = v(x)w(x) with deg(i;)deg(u;) > 1, then u n x n = v(x)w(x) (modulo p). By 
unique factorization modulo p, all but the leading coefficients of v and w are multiples 
of p, and p 2 divides vqWo — Uq. 
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20. (a) ^2(oiUj — Uj—i)(auj — Uj— i) — ^{u 3 — au 3 — i){uj — otUj— i). (b) We may 

assume that u 0 ^ 0. Let m{u) = J] 1<j<n min(l, |qj|) = u 0 /M{u)u n . Whenever 
|ckj| < 1, change the factor x — a 3 to a 3 x — 1 in u{x ); this doesn’t affect ju|. Now 
looking at the leading and trailing coefficients only, we have \u \ 2 > \u n \ 2 m{u ) 2 
\u n \ 2 M(u) 2 ; hence we obtain the slightly stronger result M(u ) 2 < (M 2 + (M 4 - 
4|w 0 u n | 2 ) 1/2 )/2|u n | 2 . [A further improvement in the estimate of M(u) can often 
be obtained by replacing u(x) by the polynomial u(x) = ( x k — Sk/\u\ 2 )u{x), where 
Sk = YljUjnj—k', since M(u) = M(u ) and \u \ 2 = ]-a| 2 — |s/c| 2 /|u| 2 , we obtain the 
inequality \u \ 2 — \s k \ 2 /\u \ 2 > \u 0 \ 2 \s k \ 2 /(\u\ A M(u) 2 ) -f- \u n \ 2 M(u ) 2 for 1 < k < n. 
In the case of polynomial (22), we have s 2 = —72 and we obtain the bound M{u) < 
8.1837.] (c)'Uj = u-m ^2 ai i • • • an elementary symmetric function, hence \u 3 \ < 

| u m\ Pii • • ■ Pim-j where — max(l, jori]). We complete the proof by showing that 
when X\ > 1, ..., x n > 1, and Xi ... x n = M, the elementary symmetric function 
o-nk = “-Xi k is < + ( n Z 1 )> the value assumed when X\ = ••• = 

x n —i — 1 and x n = M. (For if xi < • • • < x n < M, the transformation x n <— 
Xn—iXn, x n —i <— 1 increases a nk by <X( n — 2 )(fc— i){x n — l)(x n —i — 1), which is positive.) 
(d) M < |v m |(( m - 1 )iU(t)) + (7ri 1 )) < |«„|(( r "- 1 )M(ti) + (7r 1 1 )) Since \v m \ < \u n \ 
and M(v) < M(u). [See J. Vicente Gongalves, Revista de Faculdade de Ciencias (2) 
A 1 (Univ. Lisbon, 1950), 167-171; M. Mignotte, Math. Comp. 28 (1974), 1153-1157; 
Mignotte and Payafar, RAIRO Analyse numer. 13 (1979), 181-192.] 

21. ( a ) Jo {u n €(n6) -)- • • • -)- Uo)(u n e( — n6) -(-■•• -f- uo ) d9 =■ \u n | 2 —|— - * * —]— |[ 2 since 
Jo e(j9)e(-kff)dg = 6 _ ,-fcj now use induction on t. (b) Since \v 3 | < tf)M(v)\v m \ we con¬ 
clude that \v\ 2 < ( 2 ™)A/(v) 2 \v m \ 2 . Hence \v\ 2 \w\ 2 < ( 2 r ^)( 2 fc fc )M(v) 2 M(u;) 2 |w m ^| 2 = 
f(m, k)M(u) 2 \u n \ 2 < f{m, k)\u\ 2 . [Slightly better values for f(m, k) are possible based 
on the more detailed information in exercise 20.] (c) The case t = 3 suffices to show 
how to get from t — 1 to t. When t — 2 we have shown that, for all $i, 

Jo Jo fo fo \v( e {di),e((p 2 ),e((j) 3 ))\ 2 \w(e(d 1 ),e(Tp 2 ),e{'ip 3))\ 2 d(p 2 dxj > 3 d0 2 #3 

< /(m 2 ,/c 2 )/(m3, kz) f* |v(e(0i), e{9 2 ), e(0 3 ))| 2 |^(e(0i), e(0 2 ), e(0 3 ))| 2 dd 2 dd 3 . 
For all <f> 2j 03, 02, 03 we have also shown that 

fo fo 1 | v ( e ( 0 i)> e ( 02 >, e(0 3 ))| 2 |u'(e(0i), e(0 2 ), e(0 3 ))| 2 d<p i d0i 

< /(mi, Ari)/ 0 X |^(e( 0 i),e( 0 2 ),e( 0 3 ))| 2 |u;(e( 6 li),e( 0 2 ),e( 03))| 2 dOi. 

Integrate the former inequality with respect to 0i and the latter with respect to 0 2 , 0 3 , 
02 , 03- [This method was used by A. O. Gel’fond in Transcendental and Algebraic 
Numbers (New York: Dover, 1960), Section 3.4, to derive a slightly different result.] 

22. More generally, assume that u(x) = v(x)w{x) (modulo q), a(x)v(x)b(x)w(x) = 1 
(modulo p), and c£(v) = 1 (modulo r), deg(a) < deg(u/), deg(6) < deg(v), deg(u) — 
deg(t>) + deg(u>), where r — gcd(p, q) and p, q needn’t be prime. We shall construct 
polynomials V(x) = v(x) and W(x) = w(x) (modulo q) such that u(x ) = V(x)W(x) 
(modulo qr ), £(V) — £(v), deg(V) = deg(v), deg(VF) = deg(w); furthermore, if r is 
prime, the results will be unique modulo qr. 

The problem asks us to find v(x) and w(x) with V(x) = v(x) + qv(x), W(x) = 
w(x) -f- qw(x), deg(v) < deg(v), deg(u>) < deg(w); and the other condition 

(i>(z) + qv(x))(w(x) -f- qw(x)) = u(x) (modulo qr) 
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is equivalent to w(x)v(x) + v(x)w(x) = f(x) (modulo r), where f(x) satisfies u(x) = 
v(x)w(x) + qf{x) (modulo qr). We have 

(a(x)/(x) + £(x)u>(x))v(x) + ( b(x)f{x ) — t(x)v(x))w(x) = f(x) (modulo r) 

for all t(x). Since £(v) has an inverse modulo r, we can find a quotient t(x) by 
Algorithm 4.6.ID such that deg(bf—tv) < deg(v); for this t(x), deg (af-\-tw) < deg(u>), 
since we have deg(/) < deg(w) — deg(v) -f deg(iy). Thus the desired solution is v(x) = 
b(x)f(x) — t(x)v(x) = b(x)f (x) mod v(x), w(x) = a(x)f(x) + £(x)w(x). If (v{x), H)(:x )) is 
another solution, we have (ty(x) — w(x))v(x) = (f>(x) — v(x))w{x) (modulo r). Thus if 
r is prime, v(x) must divide v{x) — u(x); but deg(u — v) < deg(v), so d(x) = v(x) and 
w(x) = w(x). 

For p — 2, the factorization proceeds as follows (writing only the coefficients, 
and using bars for negative digits): Exercise 10 says that vi(x) — (ill), u>i(x) = 
(TlTOOlT) in one-bit two’s complement notation. Euclid’s extended algorithm yields 
a(x) = (100001), b(x) — (10). The factor v(x) = x 2 -\- c\x c 0 must have jci| < 
[1 + V^13J = 11, |co| < 10, by exercise 20. Three applications of Hensel’s lemma 
yield V 4 (x) — (131), w^(x) — (13 5 443 5). Thus C\ = 3 and cq = —1 (modulo 16); 
the only possible quadratic factor of u(x) is x 2 -f- 3x — 1. Division fails, so u(x) is 
irreducible. (Since we have now proved the irreducibility of this beloved polynomial 
by four separate methods, it is unlikely that it has any factors.) 

Hans Zassenhaus has observed that we can often speed up such calculations by 
increasing p as well as q: In the above notation, we can find A(x), B(x) such that 
A(x)V(x)d~B(x)W(x) = 1 (modulo pr), namely by taking A(x) = a(x)-\-pa{x), B(x) = 
b(x) + pb(x), where a(x)V(x) + b(x)W(x) = g(x) (modulo r), a(x)V(x) -f b(x)W(x) = 
1 — pg{x) (modulo pr). We can also find C with £(V)C = 1 (modulo pr). In this 
way we can lift a squarefree factorization u(x) = v(x)w(x) (modulo p) to its unique 
extensions modulo p 2 , p 4 , p 8 , p 1G , etc. However, this “accelerated” procedure reaches a 
point of diminishing returns in practice, as soon as we get to double-precision moduli, 
since the time for multiplying multiprecision numbers in practical ranges outweighs the 
advantage of squaring the modulus directly. From a computational standpoint it seems 
best to work with the successive moduli p, p 2 , p 4 , p 8 , ..., p E , p B+e , p E+2e , p E+3e , 
..., where E is the smallest power of 2 with p B greater than single precision and e is 
the largest integer such that p e has single precision. 

Hensel’s lemma, which K. Hensel introduced in order to demonstrate the fac¬ 
torization of polynomials over the field of p-adic numbers (see exercise 4.1-31), can be 
generalized in several ways. First, if there are more factors, say u(x) = Vi(x)v 2 {x)v 3 {x) 
(modulo p), we can find ai(x), a 2 {x), a 3 (x) such that ai(x)v 2 (x)ti 3 (x)-)-a 2 (x)t>i(x)u 3 (x)-j- 
a 3 (x)vi(x)v 2 (x) = 1 (modulo p) and deg (a t ) < deg(u t ). (In essence, l/ii(x) is expanded 
in partial fractions as a,i(x)/vi(x).) An exactly analogous construction now allows 
us to lift the factorization without changing the leading coefficients of v\ and we 
takeDi(x) — ai(x)/(x)mod?; 1 (x), v 2 (x) = a 2 (x)/(x) mod v 2 (x), etc. Another important 
generalization is to several simultaneous moduli, of the respective forms p e , (x 2 —a 2 ) n2 , 
..., (xt — a t ) nt , when performing multivariate gcds and factorizations. Cf. D. Y. Y. 
Yun, Ph.D. Thesis (M.I.T., 1974). 

23. The discriminant of pp(u(x)) is a nonzero integer (cf. exercise 4.6.1-12), and there 
are multiple factors modulo p iff p divides the discriminant. [The factorization of (22) 
modulo 3 is (x + l)(x 2 — x — l) 2 (x 3 -f~ x 2 — x 1); squared factors for this polynomial 
occur only for p = 3, 23, 233, and 121702457. It is not difficult to prove that the 
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smallest prime that is not unlucky is at most 0(n log Nn), if n — deg(u) and N bounds 
the coefficients of zt(x).] 

24. Multiply a monic polynomial with rational coefficients by a suitable nonzero 
integer, to get a primitive polynomial over the integers. Factor this polynomial over 
the integers, and then convert the factors back to monic. (No factorizations are lost 
in this way; see exercise 4.6.1-8.) 

25. Consideration of the constant term shows there are no factors of degree 1, so if 
the polynomial is reducible, it must have one factor of degree 2 and one of degree 3. 
Modulo 2 the factors are x(x -f- l) 2 (x 2 + x + 1); this is not much help. Modulo 3 the 
factors are (x 4- 2) 2 (x 3 -(- 2x -\- 2). Modulo 5 they are (x 2 + x -j- l)(x 3 -j- 4x -f— 2). So 
we see that the answer is (x 2 + x -f- l)(x 3 — x + 2). 

26. Begin with D <— (0...01), representing the set {0}. Then for 1 < j < r, set 
D D V (D 1 dj), where V denotes logical “or” and D*]d denotes D shifted left d bit 
positions. (Actually we need only work with a bit vector of length \(n -\- l)/2], since 
n — m is in the set iff m is.) 

27. Exercise 4 says that a random polynomial of degree n is irreducible modulo p with 
rather low probability, about 1/n. But the Chinese remainder theorem implies that a 
random monic polynomial of degree n over the integers will be reducible with respect 
to each of k distinct primes with probability about (1 — l/n) fc , and this approaches zero 
as k —► oo. Hence almost all polynomials over the integers are irreducible with respect 
to infinitely many primes; and almost all primitive polynomials over the integers are 
irreducible. [Another proof has been given by W. S. Brown, AMM 70 (1963), 965-969. 
See also the generalization cited in the answer to exercise 36.] 

28. Cf. exercise 4; the probability is the coefficient of z n in {\-\ r ai v z/p){l-\-a 2 V z 2 /p 2 )X 

(l-(-a 3 pZ 3 /p 3 )..., which has the limiting value g(z ) = (1 + z)( 1 -f- \z 2 ){\ + \z 3 ) ■ • • • 
For 1 < n < 10 the answers are 1, f, §£, [Let 

f(y) = ln(l -f y) — y = 0(y 2 ). We have 

g(z) = exp(X; n >! z n /n + J 2 n >i fi 2 ”/ 71 )) = ~ z), 

and it can be shown that the limiting probability is h(l) = expQ^ n>1 /(1/n)) — 
e ~ 7 « .56146 as n —> oo; cf. D. H. Lehmer, Acta Arith. 21 (1972), 379-388. Indeed, 
N. G. de Bruijn has established the asymptotic formula lim p _>oo flnp = e ~ 1 -\-e~ 7 /n-f- 
0 (n ~ 2 logn).] 

29. Let qi(x) and < 72 (x) be any two of the irreducible divisors of g(x). By the Chinese 
remainder theorem (exercise 3), choosing a random polynomial t(x) of degree < 2d is 
equivalent to choosing two random polynomials £i(x) and £ 2 ( 1 ) of degrees < d, where 
U(x) = t(x)modqi(x). The gcd will be a proper factor if ti(x) (p —1)/2 modqi(x) = 1 

and t 2 (x)^ p —1 ^ 2 modqi(x) ^ 1, or vice versa, and this condition holds for exactly 
2({p d — l)/2)((p d + l)/2) = (p 2d — l)/2 choices of ti(x) and £ 2 ( 2 ). 

Notes: We are considering here only the behavior with respect to two irreducible 
factors, but the true behavior is probably much better. Suppose that each irreducible 
factor q t (x) has probability \ of dividing t(xf p — L/ 2 — 1 for each £(x), independent of 
the behavior for other <?>(x) and £(x); and assume that g(x) has r irreducible factors in 
all. Then if we encode each qi(x) by a sequence of 0’s and Fs according as qi(x) does or 

doesn’t divide t(xf pd ~ 1 ^ 2 — 1 for the successive £’s tried, we obtain a random binary 
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trie with r lieves (cf. Section 6.3). The cost associated with an internal node of this 
trie, having m lieves as descendants, is 0(ra 2 (logp)); and the solution to the recurrence 
A n = (2) + 2 1 ~ n (fc)Afc is An = 2 ( 2 ), by exercise 5.2.2-36. Hence the sum of costs 
in the given random trie—representing the expected time to factor g(x) completely — 
is 0(r 2 (log p) 3 ) under this plausible assumption. The plausible assumption becomes 
rigorously true if we choose t(x) at random of degree < rd instead of restricting it to 
degree < 2d. 

30. Let T{x) — x + - \-x pd 1 and v(x) = T(t(x)) mod q(x). Since t(x) pd = t(x) 

in the field of polynomial remainders modulo q(x), we have v(x) p = v(x) in that field; 
in other words, v{x) is one of the p roots of the equation y p — y — 0. Hence v(x) is an 
integer. 

It follows that Ilo<s< P S cd (0d( x )» T{t(x)) — s) = gd{x). In particular, when p = 2 
we can argue as in exercise 29 that gcd(gfd(x), T(t(x))) will be a proper factor of pd(x) 
with probability > \ when gd{x) has at least two irreducible factors and t(x) is a 
random binary polynomial of degree < 2d. 

[Note that T(t(x)) mod g(x) can be computed by starting with u(x) <— t(x) and 
setting u(x) <— {u(x) -j- , u(x) p )modp(x) repeatedly, d — 1 times. The method of this 

exercise is based on the polynomial factorization I” -* = IIos.<,( I X*)-»). which 

— d 

holds for any p, while formula (21) is based on the polynomial factorization x p —x = 
x(x ( - pd ~ 1 ^ 2 -}- l)(x (pd_1)/2 — 1) for odd p.] 

31. If a is an element of the field of p d elements, let d(a) be the “degree” of a, namely 
the smallest exponent e such that a p = a. Then 

P a (x) = (x — a){x — ol p )... (x — a p ) = q a (x) d/d ^ a \ 

where q a (x) is an irreducible polynomial of degree d(a). As a runs through all elements 
of the field, the corresponding q a (x) runs through every irreducible polynomial of 
degree e dividing d, where every such irreducible occurs exactly e times. We have 
(x + tf pd ~ modq Q (x) = 1 if and only if (o + tf pd ~ 1 ^ 2 = 1 in the field. If t is an 
integer, we have d(a -f -t) = d(a), hence n(p,d) is 1 times the number of elements 
a of degree d such that o (p — = 1. Similarly, if t\ tv we want to count the 
number of elements of degree d such that (a + ti) (pd " 1)/2 = (a + t 2 ) ( ' pd ~ 1)/2 , i-e., 
((a -)- ti)/(a + t 2 )f p = 1. As a runs through all the elements of degree d, so 

does the quantity (a 4" ti)/( a + tv) = 1 4- (^i — + £ 2 ). 

[We have n(p, d) = \d~' J^ c \ d (3 4- (—l) c )p(c)(p d/c — 1), which is about half the 
total number of irreducibles—exactly half, in fact, when d is odd. This proves that 
gcd(pd(x), (x 4- t)^ v ~ 1 ^ 2 — l) has a good chance of finding factors of gd{x) when t is 
fixed and gd{x) is chosen at random; but a probabilistic algorithm is supposed to work 
with guaranteed probability for fixed gd{x) and random t, as in exercise 29.] 

32. (a) Clearly x n — 1 = Hd\n s ^ nce ever y complex nth root of unity is a 

primitive dth root for some unique d\n. The second identity follows from the first; 
and 'l'n(z) has integer coefficients since it is expressed in terms of products and quotients 
of monic polynomials with integer coefficients, (b) The condition in the hint suffices to 
prove that f(x) = flk(x), so we shall take the hint. When p does not divide n, we have 
gcd(x n — l,nx n—x ) = 1 modulo p, hence x n — 1 is squarefree modulo p. Given f(x) 
and f as in the hint, let g(x) be the irreducible factor of ^ n (x) such that p(^ p ) = 0. If 
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g(x ) 7 ^ f(x ) then both f(x) and g(x) are distinct factors of 'l'n(x), hence they are distinct 
factors of x n — 1 , hence they have no irreducible factors in common modulo p. However, 
^ p is a root of /(x p ), so gcd(<?(x), f(x p )) ^ 1 over the integers, hence g{x) is a divisor of 
f(x p ). By (5), g(x) is a divisor of f{x) p , modulo p, contradicting the assumption that 
f{x) and g{x) have no irreducible factors in common. Therefore f(x) — g{x). [The 
irreducibility of 'l'n(x) was first proved for prime n by K. F. Gauss in Disquisitiones 
Arithmeticae (Leipzig, 1801), Art. 341, and for general n by L. Kronecker, J. de Math. 
Pures et Appliquees 19 (1854), 177-192.] 

(c) 'I'i(x) — x — 1; and when p is prime, 'Pp(x) — 1 -(- x -(- • • ■ x p ~ l . If n > 1 
is odd, it is not difficult to prove that '^(z) = 4 , n (—x). If p divides n, the second 
identity in (a) shows that % n (x) = 'P n (x p )• If p does not divide n, we have \P pn (z) = 
'l'n(z p )/'I'n(z). For nonprime n < 15 we have 4> 4 (x) = x 2 + 1, %(x) = x 2 — x + 1, 
%(x) — z 4 + 1, %(x) = z 6 + x 3 + 1, ^io(z) = x 4 — x 3 -f- z 2 — x -f 1, 4 /i 2 (z) = 
x 4 — x 2 4- L 4 , i 4 (z) = x 6 — x 5 4- Z 4 — z 3 + x 2 — X + 1, ^i 5 (z) = x 8 -x 7 -(-x 5 — 

X 4 + X 3 - z + 1. [The formula % q (x) = (1 + x p -\ -f- x (9 ~ 1)p )(z - l)/(x q - 1) 

can be used to show that 'l'pg(x) has all coefficients ^1 or 0 when p and q are prime; 
but the coefficients of % qr (x ) can be arbitrarily large.] 

33. False; we lose all p 3 with e 3 divisible by p. True if p > deg(u). [See exercise 36.] 

34. [D. Y. Y. Yun, Proc. ACM Symp. Symbolic and Algebraic Comp. (1976), 26-35.] 
Set (t(x), Vi(x), Wi(x)) <— GCD(u(x), u'(x)). If t(x) = 1, set e <— 1; otherwise set 
(ui(x), u i+ i(x),u;t+i(z)) <— GCD(ui(x), Wi{x) — v'{x)) for i = 1, 2, ..., e — 1, until 
finding w e {x) — v' e {x ) = 0. Finally set u e {x) <— v e [x). 

To prove the validity of this algorithm, we observe that it computes the polyno¬ 
mials t(x) = U 2 (x)ii 3 (x ) 2 u 4 (x ) 3 ..., Vi(x ) = Ui{x)ui+\ (x)u; 4 - 2 (x)..., and 

w t (x) = v^ix)u l+ iix)u l+ 2 {x )... + 2u»(x)i( l '+ 1 (x)ut+ 2 (x). . . 

+ 3^i(x)ui + i(x)t4' + 2 (x)... H-. 

We have gcd(t(x), u?i(x)) = 1, since an irreducible factor of iti(x ) divides all but the zth 
term of itq(x), and it is relatively prime to that term. Furthermore gcd(ui(x), Uj_|_i(x)) 
is clearly 1 . 

[Exercise 2(b) indicates that comparatively few polynomials are squarefree, but 
non-squarefree polynomials actually occur often in practice; hence this method turns 
out to be quite important. See Paul S. Wang and Barry M. Trager, SIAM J. Computing 
8 (1979), 300-305, for suggestions on how to improve the efficiency when the given 
polynomial is already squarefree.] 

35. We have w 3 (x) = gcd(%(x), v*(x)) • gcd(u* + 1 (x), v 3 {x)), where 

u*{x) — Uj{x)uj-\- i(x)... and v* (z) = v 3 {x)v 3 + i(z).... 

[Yun notes that the running time for squarefree factorization by the method of exercise 
34 is at most about twice the running time to calculate gcd(u(x), u'(x)). Furthermore if 
we are given an arbitrary method for discovering squarefree factorization, the method 
of this exercise leads to a gcd procedure. (When u{x) and v{x) are squarefree, their 
gcd is simply n> 2 (x) where w(x) = u(x)v{x) = uq(x)w; 2 (x) 2 ; the polynomials u 3 {x), 
v 3 {x), u*{x), and v*(x) are all squarefree.) Hence the problem of converting a primitive 
polynomial of degree n to its squarefree representation is computationally equivalent 
to the problem of calculating the gcd of two nth degree polynomials, in the sense of 
asymptotic worst-case running time.] 
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36. Let Uj{x) be the value computed for u Uj(x)” by the procedure of exercise 34. If 

deg(f/i)-f 2deg(f/ 2 )H-- deg(u), then Uj(x ) = Uj(x) for all j. But in general we will 

have e < p and Uj(x) = J^ fc>0 Uj+ P k(x) for 1 < j < p. To separate these factors fur¬ 
ther, we can calculate t(x)/(U 2 (x)Us(x) 2 . .. U p —i(x ) p ~ 2 ) — Y\j >p u j{ x ) p ^^ p ^ = z{x p ). 
After recursively finding the squarefree representation of z(x) = (zi(x), Z 2 (x ),...), we 
will have Zk{x ) = Jlo < J<P U 3 +P k ( x )> so we can calculate the individual m(x) by the 
formula gcd(Uj(x), Zk(x)) = u j+pk {x) for 1 < j < p. The polynomial u p k{x) will be 
left when the other factors of Zk(x) have been removed. 

Note: This procedure is fairly simple but the program is lengthy. If one’s goal is 
to have a short program for complete factorization modulo p, rather than an extremely 
efficient one, it is probably easiest to modify the distinct-degree factorization routine 
so that it casts out gcd(x p — x, u(x )) several times for the same value of d until the 
gcd is 1. In this case you needn’t begin by calculating gcd(w(x), u'(x)) and removing 

d 

multiple factors as suggested in the text, since the polynomial x v — x is squarefree. 

37. The exact probability is ]"[ > 1 {a 3 P /p 1 ) kj /kj\, where kj is the number of that are 
equal to j. Since a^p/p 5 ~ 1 /j by exercise 4, we get the formula of exercise 1.3.3-21. 

Notes: This exercise says that if we fix the prime p and let the polynomial u(x) be 
random, it will have certain probability of splitting in a given way modulo p. A much 
harder problem is to fix the polynomial u(x ) and to let p be “random”; it turns out that 
the same asymptotic result holds for almost all u(x). G. Frobenius proved in 1880 that 
the integer polynomial u(x) splits modulo p into factors of degrees d \,..., d r , when p is a 
large prime chosen at random, with probability equal to the number of permutations in 
the Galois group G of u(x) having cycle lengths {d \,..., d r } divided by the total number 
of permutations in G. (If u(x) has rational coefficients and distinct roots fi, ..., £ n 
over the complex numbers, its Galois group is the (unique) group G of permutations 

such that the polynomial Yl p{l) .., p{n)eG (z + £p(i)2/i 4-b£p(n)2/n) = U(z, yi ,..., y n ) 

has rational coefficients and is irreducible over the rationals.) Furthermore B. L. 
van der Waerden proved in 1934 that almost all polynomials of degree n have the 
set of all n! permutations as their Galois group. Therefore almost all fixed irreducible 
polynomials u(x) will factor as we might expect them to, with respect to randomly 
chosen large primes p. References: Sitzungsberichte Konigl. Preufi. Akad. Wiss. (Berlin: 
1896), 689-703; Math. Annalen 109 (1934), 13-16. See also N. Chebotarev, Math. 
Annalen 95 (1926), for a generalization of Frobenius’s theorem to conjugacy classes of 
the Galois group. 

38. (Partial solution by Peter Weinberger.) The average number of 1-cycles in a 
randomly chosen element of any transitive permutation group G on n objects is 1, since 
the probability is 1 fn that any given object is fixed. Since Galois groups are always 
transitive, the remarks in the previous answer show that a fixed irreducible polynomial 
has exactly one linear factor modulo p, on the average, as p —► oo. Thus, the average 
number of linear factors of u(x) modulo p is the number of irreducible factors of u(x) 
over the integers. 

Weinberger [Proc. Symp. Pure Math. 24 (Amer. Math. Soc., 1973), 321-332] has 
proved that if the generalized Riemann hypothesis (GRH) holds—this is a conjecture 
about the zeros of a generalized zeta function—then there is an absolute constant A\ 
with the following property: The number of prime ideals of norm < x, in any algebraic 
number field of degree n, differs from f* dt /In t by at most A\nx 1/2 ln(xA), where A is 
the absolute value of the discriminant of the irreducible polynomial u that defines the 
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field. The number of prime ideals of norm < x is at most nx l/ 2 different from the total 
number of linear factors of u modulo primes < x, since such linear factors correspond 
to ideals of norm p, while other prime ideals have norms > p 2 . Consequently if we let 
N(x) be the total number of linear factors of a given primitive polynomial u, modulo 
all primes < x, then the GRH implies that there is an absolute constant Az such that 
|iV(x)/7r(x)—r| < A 2 nx —1,/2 (lnx)(lnxA). A proof of GRH would yield a “short” proof 
of the number of irreducible factors of any primitive polynomial u over the integers, 
since we could evaluate N(x) for a value of x sufficiently large to make this error bound 
less than \. Unfortunately Az is quite large. 

SECTION 4.6.3 

1. x m , where m — 2 x(n \ the highest power of 2 less than or equal to n. 

2. Assume that x is input in register A, and n in location NN; the output is in 
register X. 

01 A1 ENTX 1 1 Al. Initialize . 

02 STX Y 1 Y <- 1. 

03 STA Z 1 Z <-x. 

04 LDA NN 1 N 4- n. 

05 JMP 2F 1 To A2. 

06 5H SRB 1 L + 1 — K 

07 STA N L~\-\ — K N «— [N/2\. 

08 A5 LDA Z L A5. Square Z. 

09 MUL Z L Z X Z mod w 

10 STX Z L -v Z. 

11 A2 LDA N L A2. Halve N . 

12 2H JAE 5B L + 1 To A5 if N is even. 

13 SRB 1 K 

14 A4 JAZ 4F K Jump if N — 1. 

15 STA N K — 1 TV +- [JV/2J. 

iff A3 LDA Z AT — 1 A3. Multiply Y by Z. 

17 MUL Y K — 1 Zxrmodw 

18 STX Y AT — 1 -+ y. 

JMP A5 K— 1 To A5. 

20 4H LDA Z 1 

MUL Y 1 Do the final multiplication. | 

[It would be better programming practice to change the instruction in line 05 to “JAP”, 
followed by an error indication. The running time is 21L + 16AT -j- 8, where L = \(n) 
is one less than the number of bits in the binary representation of n, and K ~ v{ri) is 
the number of 1 bits in that representation.] 

For the serial program, we may assume that n is small enough to fit in an index 
register; otherwise serial exponentiation is out of the question. The following program 
leaves the output in register A:. 

01 SI LD1 NN 1 rll n. 

02 STA X 1 X <- x. 

03 JMP 2F 1 
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04 1H MUL X N — 1 rAxX mod w 
05 SLAX 5 N — 1 -► rA. 

06 2H DEC1 1 N rll <- rll — 1. 

07 J1P IB N Multiply again if rll >0. | 

The running time for this program is 14 N — 7; it is faster than the previous program 
when n < 7, slower when n > 8. 

3. The sequences of exponents are: (a) 1, 2, 3, 6, 7, 14, 15, 30, 60, 120, 121, 242, 

243, 486, 487, 974, 975 [16 multiplications]; (b) 1, 2, 3, 4, 8, 12, 24, 36, 72, 108, 

216, 324, 325, 650, 975 [14 multiplications]; (c) 1, 2, 3, 6, 12, 15, 30, 60, 120, 240, 

243, 486, 972, 975 [13 multiplications]; (d) 1, 2, 3, 6, 12, 15, 30, 60, 75, 150, 300, 

600, 900, 975 [13 multiplications]. [The smallest possible number of multiplications is 
12; this is obtainable by combining the factor method with the binary method, since 
975 = 15 ■ (2 6 -)- 1).] 

4. (777777)8 = 2 18 — 1. 

5. Tl. [Initialize.] Set LINKU[j] 4 - 0 for 1 < j < 2 r , and set k <- 0, LINKR[0] 4 - 1, 

LINKR[1] 0. 

T2. [Change level.] (Now level k of the tree has been linked together from left to 
right, starting at LINKR[0].) If k = r, the algorithm terminates. Otherwise 
set n 4 - LINKR[0], m 4 - 0. 

T3. [Prepare for n] (Now n is a node on level k, and m points to the rightmost 
node currently on level k -f-1.) Set q <— 0, s <*— n. 

T4. [Already in tree?] (Now s is a node in the path from the root to n.) If 
LINKUfn -j- s] 7 ^ 0, go to T 6 (the value n -f- s is already in the tree). 

T5. [Insert below n] If q = 0, set m! <— n + s. Then set LINKRfn + s] <— q, 
LINKU[n 4- s] 4 — n, q <— n + s. 

T 6 . [Move up.] Set s <— LINKU[s]- If s 7 ^ 0, return to T4. 

T7. [Attach group.] If q yZ 0, set LINKR[m] *— q, m <— m!. 

T 8 . [Move n.] Set n <— LINKR[n], If n 7 ^ 0, return to T3. 

T9. [End of level.] Set LINKRfm] <— 0, k *— k 1, and return to T2. | 

6. Prove by induction that the path to the number 2 e ° 2 ei + ■ • ■ + 2 et , if eo > 

ei > • • • > e t > 0, is 1, 2, 2 2 , ..., 2 e °, 2 e °+2 ei , ..., 2 e ° + 2 ei ~\ - \-2 et ; furthermore, 

the sequences of exponents on each level are in decreasing lexicographic order. 

7. The binary and factor methods require one more step to compute x 2n than x n ; 
the power tree method requires at most one more step. Hence (a) 15 • 2 k ; (b) 33 • 2 fc ; 
(c) 23 • 2 fc ; k = 0, 1, 2, 3, ... . 

8. The power tree always includes the node 2m at one level below m, unless it occurs 
at the same level or an earlier level; and it always includes the node 2m -f- 1 at one 
level below 2m, unless it occurs at the same level or an earlier level. [It is not true 
that 2m is a son of m in the power tree for all m; the smallest example where this fails 
is m = 2138, which appears on level 15, while 4276 appears elsewhere on level 16. In 
fact, 2m sometimes occurs on the same level as m; the smallest example is m = 6029.] 

9. Start with N <— n, Z *— x, and Yk <— 1 for 1 < k < m; in general we will have 
x n — YiT 2 2 ... Y™Y}Z n . If N > 0, set k 4 — N mod m, and if k yZ 0 set Y k 4 — Y k • Z. 
Then set Z *- Z m , N <— [N/m\, and repeat. Finally set Y k *— Yjt-Yfc+i for k — m — 2, 
m — 3, ..., 1; the answer is Yi... Y m — i- 
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10. By using the “FATHER” representation discussed in Section 2.3.3: Make use of a 
table f[j], 1 < j < 100, such that /[ 1] = 0 and f[j] is the number of the node just 
above j for j > 2. (The fact that each node of this tree has degree at most two has 
no effect on the efficiency of this representation; it just makes the tree look prettier as 
an illustration.) 

11. 1, 2, 3, 5, 10, 20, (23 or 40), 43; 1, 2, 4, 8, 9, 17, (26 or 34), 43; 1, 2, 4, 8, 9, 17, 
34, (43 or 68), 77; 1, 2, 4, 5, 9, 18, 36, (41 or 72), 77. If either of the latter two paths 
were in the tree we would have no possibility for n = 43, since the tree must contain 
either 1, 2, 3, 5 or 1, 2, 4, 8, 9. 

12. No such infinite tree can exist, since l(n) ^ l*(n) for some n. 

13. For Case 1, use a Type-1 chain followed by 2 A+C -f- 2 S+C + 2 A -j- 2 s ; or use the 
factor method. For Case 2, use a Type-2 chain followed by 2 A ^~ C+1 -\-2 B ~ >rC -\-2 A ~\~2 B . 
For Case 3, use a Type-5 chain followed by addition of 2 A -j- 2 A ~ 1 , or use the factor 
method. For Case 4, n — 135 • 2°, so we may use the factor method. 

14. (a) It is easy to verify that steps r — 1 and r — 2 are not both small, so let us 
assume that step r — 1 is small and step r — 2 is not. If c = 1 , then X(a r —i) — \(a r —k), 
so k = 2; and since 4 < v{a r ) — v(a r -i) + ^(a r _ fc ) — 1 < v{a r ~ i) -f 1, we have 
v{a r — i) > 3, making r — la star step (lest ao, ai, ..., a r — 3 , a r —1 include only one 
small step). Then a r —i = a r —2 + a r — q for some q, and if we replace a r — 2 , a r _ 1 , 
a r by a r — 2 , 2 a r — 2 , 2 a r —2 + a r — Q = a r , we obtain another counterexample chain 
in which step r is small; but this is impossible. On the other hand, if c > 2, then 
4 < v(ar) < v(a r - 1 ) + v(a r -k) — 2 < is(a r -i); hence v(a r - 1 ) — 4, v(a r -k ) = 2, 
and c — 2 . This leads readily to an impossible situation by a consideration of the six 
types in the proof of Theorem B. 

(b) If \(a r —k) < m — 1, we have c > 3, so v{a r —k) + ^(dr— 1 ) > 7 by (22); 
therefore both i/(a r —k) and v(a r — 1 ) are > 3. All small steps must be < r — k, and 
\(a r —k) = m — k-\-l. If k > 4, we must have c = 4, k = 4, u(a r — 1 ) = u(a r — 4 ) = 4; 
thus a r —i > 2 rn + 2 m - 1 -f- 2 m ~ 2 , and a r _i must equal 2 m -h 2 m - 1 + 2 m - 2 + 2 m - 3 ; but 
a r —4 > |a r _ 1 now implies that a r —1 = 8 a r — 4 • Thus k = 3 and a r —1 > 2 m J r 2 m ~ 1 . 
Since a r —2 < 2 m and a r _3 < 2 TO—1 , step r — l must be a doubling; but step r — 2 is 
a nondoubling, since a r —1 7 ^ 4a r _ 3 . Furthermore, since u{a r — 3 ) > 3, r — 3 is a star 
step; and a r —2 — ar —3 + a r _ 5 would imply that a r —5 = 2 m—2 , hence we must have 
a r —2 = a r _ 3 -)- a r _ 4 . As in a similar case treated in the text, the only possibility is 
now seen to be a r —4 = 2 m ~ 2 -f- 2 m—3 , a r —3 = 2 m—2 -{- 2 m—3 -j- 2 d+1 -j- 2 d , a r —1 = 
2 m + 2" 1 ” 1 + 2 d+2 -(- 2 d_f and even this possibility is impossible. 

16. l B (n ) = X(n) + is(ri) — 1; so if n = 2 fc , Z B (n)/X(n) = 1, but if n = 2 k+1 — 1, 
l B (n)/\{n) = 2. 

17. Let ii <••■< it- Delete any intervals A that can be removed without affecting 

the union A U ■ ■ • U It- (The interval (jk,ik\ may be dropped out if either jk+i < jk 
or ji < ]2 < ' and jk +1 < A—i-) Now combine overlapping intervals , 

( jd, id] into an interval (/, %'] = ( ji,id] and note that 

a*/ < a,v(l + <5r~ J1+ '- +td ~ Jd < a jf (l + 6 ) 2{l '~ j,) , 
since each point of (/, i'] is covered at most twice in (ji, ii] U • * • U (jd, id]. 
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18. Call f(m) a “nice” function if (log f(m))fm —► 0 as m —► oo. A polynomial in m 
is nice. The product of nice functions is nice. If g(m ) —► 0 and c is a positive constant, 
then c m9(m) is nice; also ( m ^)) is nice, for by Stirling’s approximation this is equivalent 
to saying that g(m)\og(l/g(m)) —*■ 0. 

Now replace each term of the summation by the maximum term that is attained 
for any s, t, v. The total number of terms is nice, and so are < 2 t+v , 

and ft 2v , because (t + v)/m -+ 0. Finally, (A m + S ) 2 ) < (2 m) 2i /t\ < (4ra 2 /t)V, where 
(4e)* is nice; setting t to its maximum value (1 — %e)m/\(m), we have the upper bound 
(• m 2 /t) t = (mX(m)/( 1 — Je))* = 2 m(1—e/2) • f(m), where f(m) is nice. Hence the entire 
sum is less than a m for large m, if a — 2 1 —v , 0 < rj < 

19. (a) M n AT, M U N, M l±l IV, respectively; see Eqs. 4.5.2-6, 4.5.2-7. 

(b) f(z)g(z), lcm (f(z), g{z)), gcd (f(z), g{z)). (For the same reasons as (a), be¬ 
cause the monic irreducible polynomials over the complex numbers are precisely the 
polynomials z — £.) 

(c) Commutative laws A\&B = B(±lA, A\jB = B\jA, Af]B = Bn A. Associative 

laws Al±j(Bl±lC) = (Al+!B)l+JC, A\J{B\JC) = (AuB)uC, An{Bf]C) = (AnB)nC. 
Distributive laws A U (B n C) = (A U B) fi (A U C), A f| {B U C) = (A D B) U (A fl C), 
A l±J {B U C) = {A l±) B) U {A l±l C), A l±l (B fl C) = {A y B) fl (A !±) C). Idempotent 
laws A{JA = A, — A. Absorption laws A U (A n B) = A, A fl {A U B) — A, 

An(A\±)B) = A, A(J{A\±)B) = A\SB. Identity and zero laws 0i+i A — A, 0U A = A, 
0 n A = 0, where 0 is the empty multiset. Counting law A l±J B = (A U B) l±) (A fl B). 
Further properties analogous to those of sets come from the partial ordering defined 
by the rule AC B iff A fl B = A (iff A U B — B). 

Notes: Other common applications of multisets are zeros and poles of meromorphic 
functions, invariants of matrices in canonical form, invariants of finite Abelian groups, 
etc.; multisets can be useful in combinatorial counting arguments and in the develop¬ 
ment of measure theory. The terminal strings of a noncircular context-free grammar 
form a multiset that is a set if and only if the grammar is unambiguous. Although 
multisets appear frequently in mathematics, they often must be treated rather clum¬ 
sily because there is currently no standard way to treat sets with repeated elements. 
Several mathematicians have voiced their belief that the lack of adequate terminology 
and notation for this common concept has been a definite handicap to the development 
of mathematics. (A multiset is, of course, formally equivalent to a mapping from a 
set into the nonnegative integers, but this formal equivalence is of little or no practical 
value for creative mathematical reasoning.) The author has discussed this matter with 
many people in an attempt to find a good remedy. Some of the names suggested for 
the concept were list, bunch, bag, heap, sample, weighted set, collection; but these 
words either conflict with present terminology, have an improper connotation, or are 
too much of a mouthful to say and to write conveniently. It does not seem out of 
place to coin a new word for such an important concept, and “multiset” has been sug¬ 
gested by N. G. de Bruijn. The notation “A l+J B ” has been selected by the author 
to avoid conflict with existing notations and to stress the analogy with set union. It 
would not be as desirable to use “A +J5” for this purpose, since algebraists have found 
that A -j- B is a good notation for { a -f- j3 \ a £ A and ft £ B). If A is a multiset 
of nonnegative integers, let G(z) = ^2 n€A be a generating function corresponding 
to A. (Generating functions with nonnegative integer coefficients obviously correspond 
one-to-one with multisets of nonnegative integers.) If G(z) corresponds to A and H(z) 
to B, then G(z) -f H{z) corresponds to A l+J H and G(z)H(z) corresponds to A + B. 
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If we form “Dirichlet” generating functions g(z ) = ^ n€ ^l/n ;J , h{z ) = EnesVn 2 , 
then the product g(z)h(z) corresponds to the multiset product AB. 

20. Type 3: (S 0 ,...,S r ) = (M 00 ,..., M r0 ) = ({0}, ..., {A}, {A—l, A}, {A—l, A, A}, 

{A— l,A—l,A,A,A}, ..., {A+C — 3,A+C — 3,A+C-2,A+C — 2,A + C-2}). 
Type 5: {M 00 , ..., M r0 ) = ({0}, ..., {A}, {A—1,A}, ..., {A + C -1,A + C}, 
{A + C — 1,A + C — 1,A+C}, {A + C + I>-l,A+C + £>-l,A + C + Z>}); 
(M 0 i,...,M r i) = (0, ...,0, 0, 0, {A + C-2}, ..., {A + C + D - 2}), = 

Mio 1+) Mu. 

21. For example, let u — 2 8t?+5 , x = (2^ +1 ^ u — 1)/(2 U — 1) = 2 qu -j- • ■ • -\- 2 U -f-1, 
y = 2 {q+1)u -f 1. Then xy = (2 2{q + 1)u — l)/(2 u — 1). If n = 2 4(<J+1)u + xy , we 
have l(n ) < 4(q l)u q -j- 2 by Theorem F, but l*(n) = 4(q 4- l)w + 2q + 2 by 
Theorem H. 

22. Underline everything except the u — 1 insertions used in the calculation of x. 

23. Theorem G (everything underlined). 

24. Use the numbers ( B a ' — 1 )/{B — 1), 0 < i < r, underlined when a,i is underlined; 
and c k B i ~ 1 (B b i — 1 )/{B — 1) for 0 < j < t, 0 < i < 6 J + 1 — b jf 1 < k < 1°{B), 
underlined when c k is underlined, where c 0 , Ci, ... is a minimum length /°-chain for B. 
To prove the second inequality, let B — 2 m and use (3). (The second inequality is 
rarely, if ever, an improvement on Theorem G.) 

25. We may assume that d k = 1. Use the rule R A fc _i... Ai, where A 3 = “XR” if 
dj = 1, Aj = “R” otherwise, and where “R” means take the square root, “X” means 
multiply by x. For example, if y = (.1101101) 2 , the rule is R R XR XR R XR XR. 
(There exist binary square-root extraction algorithms suitable for computer hardware, 
requiring an execution time comparable to that of division; computers with such 
hardware could therefore calculate more general fractional powers using the technique 
in this exercise.) 

26. If we know the pair (F k ,F k - 1 ), then we have (F k+1 ,F k ) = (F k + F k -i,F k ) and 
{F 2 k,F 2 k — i) = (F k -f- 2F k F k —i,F k -{- F k _i)] so a binary method can be used to 
calculate (F n ,F n — i), using O(logn) arithmetic operations. Perhaps better is to use the 
pair of values (F k ,L k ), where L k — F k — i F k +1 (cf. Section 4.5.4); then we have 
(Ffc+i, Lfc+i) = (^(Ffc -\- L k ), |(5 F k + Fk)), [F2k, L2 k ) = (F k L k ,Ll. — 2(—l) fc ). 

For the general linear recurrence x n = aix n —i +- \- a d x n — d , we can compute 

x n in 0(d 3 log n) arithmetic operations by computing the nth power of an appropriate 
dX d matrix. [This observation is due to J. C. P. Miller and D. J. S. Brown, Comp. 
J. 9 (1966), 188-190.] 

27. First form the 2 m — m — 1 products xl 1 ... for all sequences of exponents such 
that 0 < e 3 < 1 and ei -(-••• 4 ~ e m >2. Let n 3 — ( d 3 \ .. .djidjo) 2 ', to complete the 
calculation, take zf ix ... then square and multiply by xf u ... for i = X — 1, 

..., 1, 0. [Straus showed in AMM 71 (1964), 807-808, that 2\(n) may be replaced by 
(l-(-e)X(n) for any e > 0, by generalizing this binary method to 2 fc -ary as in Theorem D. 
See exercise 39 for later developments.] 

28. (a) x Vy = x Vy V(x-\-y), where “V” is logical “or”, cf. exercise 4.6.2-26; clearly 
u(x V y) < v(x V y) 4- A y) — u(x) 4~ v{y)- (b) Note first that Ai—i/2 di - 1 C 
Ai/2 di for 1 < i < r. Secondly, note that d 3 — di— 1 in a nondoubling; for otherwise 
ai —1 > 2aj > a 3 + a k = ai. Hence Aj C 1 and A k C A z -i/2 d J~ dk . (c) An 
easy induction on i, except that close steps need closer attention. Let us say that 
m has property P{oc) if the l’s in its binary representation all appear in consecutive 
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blocks of > a in a row. If m and m! have P(ct), so does m V m'; if m has P{ot) then 
p(m) has P(a + <5)- Hence Bi has P( 1 -f be/). Finally if m has P(a) then v(p(m)) < 
(a -f- 8)u(m)/a\ for i/(m) = ui -f- • • • + v q , where each block size Vj is > a, hence 
v(p{m)) < (vi 8) -j- • • • -f- {vq + 6) < (1 + bf a)^i + • * • + (1 + <5/ a)u q . (d) Let 
f — b r c r be the number of nondoublings and s the number of small steps. If 
/ > 3.271 lgi'(n) we have s > lg v(ri) as desired, by (16). Otherwise we have <n < 
(1 + 2~ 6 ) bi 2 c '+ di for 0 < i < r, hence n < ((1 + 2~ 6 )/2) br 2 r , and r > lgn + b r — 
br lg(i + 2~ 6 ) > lgn-flg^(n) — lg(l -f 8c r ) — b r lg(l + 2 -<5 ). Let 8 = flg(/ + 1)]; 
then ln(l + 2~ 6 ) < ln(l -f 1 /(/ + 1)) < l/(/ + 1) < 8/(1 + 8f), and it follows 
that lg(l + 8x) + (/ — ^)lg(l + ^~ 6 ) < lg(l + 8f) for 0 < x < f. Hence finally 
l(n ) > lgn + lg^(n) — lg(l+ (3.271 lgz/(n))[lg(l +3.271 lg^(n))]). [Theoretical Comp. 
Sci. 1 (1975), 1-12.] 

29. In the paper just cited, Schonhage refined the method of exercise 28 to prove that 
l(n ) > lgn 4- Ig^(^) — 2.13 for all n. Can the remaining gap be closed? 

30. n = 31 is the smallest example; £(31) = 7, but 1, 2, 4, 8, 16, 32, 31 is an 
addition-subtraction chain of length 6. [After proving Theorem E, Erdos stated that 
the same result holds also for addition-subtraction chains. Schonhage has extended 
the lower bound of exercise 28 to addition-subtraction chains, with v(n) replaced by 
L>(n ) = minimum number of nonzero digits to represent n = ( n q ... no )2 where each 
rij is —1, 0, or +1. This quantity i>(n) is the number of l’s, in the ordinary binary 
representation of n, that are immediately preceded by 0 or by the string 00(10) fc l for 
some k > 0.] 

32. First compute 2 l for 1 < i < X(n m ), then compute each n = rij by the following 
variant of the 2 fc -ary method: For all odd i < 2 fc , compute /; = ]T]{ 2 fct+e | dt = 2 e i } 
where n = (... dido) 2 k , in at most lgnj steps; then compute n = ^ ifi in at most 

l(i) + 2 fc—1 further steps. The number of steps per rij is < lgnj -|- 0(k2 k ), and 
this is \(n)/\\(n) -f- 0(X(n)XXX(n)/XX(n) 2 ) when k — [lglgn — 31glglgnJ. 

[A generalization of Theorem E gives the corresponding lower bound. Reference: 
SIAM J. Computing 5 (1976), 100-103. See also exercise 39.] 

33. The following construction due to D. J. Newman provides the best upper bound 
currently known: Let k = pi... p r be the product of the first r primes. Compute k 
and all quadratic residues mod k by the method of exercise 32, in 0(2~~ r k\ogk ) steps 
(because there are approximately 2 ~ T k quadratic residues). Also compute all multiples 
of k that are < m 2 , in about m 2 /k further steps. Now m additions suffice to compute 
l 2 , 2 2 , ..., m 2 . We have k = exp (p r + 0(p r /(logp r ) x000 )) where p r is given by the 
formula in exercise 4.5.4-35; so by choosing 

r = |_(l + (1 + i In 2)/lg lg m) In m/ln In mj 

it follows that £(1 2 ,..., m 2 ) = m- f- 0(m ■ exp(—\In 2 In m/ln In m)). 

On the other hand, D. Dobkin and R. Lipton have shown that, for any e > 0, 
£(1 2 ,..., m 2 ) > m-fm 2 / 3-e when m is sufficiently large [SIAM J. Computing 9 (1980), 
121-125]. 

35. See Discrete Math. 23 (1978), 115-119. 

36. Eight; there are four ways to compute 39 — 12 —(— 12 -f- 12 —(— 3 and two ways to 
compute 79 = 39 + 39 -|- 1. 
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37. The statement is true. The labels in the reduced graph of the binary chain are 
\n/2 k \ for k = eo, ..., 0; they are 1, 2, ..., 2 e °, n in the dual graph. [Similarly, the 
right-to-leTt m-ary method of exercise 9 is the dual of the left-to-right method.] 

38. 2 t are equivalent to the binary chain; it would be 2 t—1 if e 0 = ei +1. The number 
of chains equivalent to the scheme of Algorithm A is the number of ways to compute 
the sum of t -f- 2 numbers of which two are identical. This is ift +i + ift, where f m 
is the number of ways to compute the sum of m + 1 distinct numbers. When we take 
commutativity into account, we see that f m is 2~ m times (m -f- 1)! times the number 
of binary trees on m nodes, so f m = (2m — 1)(2 m — 3)... 1. 

39. The quantity n 2 , ..., n m \) is the minimum of arcs — vertices -f- m taken 
over all directed graphs having m vertices s 3 whose in-degree is zero and one vertex t 
whose out-degree is zero, where there are exactly rij oriented paths from s 3 to t for 
1 < j < m. The quantity l(m t n 2 , ■ ■., n m ) is the minimum of arcs—vertices+ 1 taken 
over all directed graphs having one vertex s whose in-degree is zero and m vertices t 3 
whose out-degree is zero, where there are exactly n 3 oriented paths from s to tj for 
1 < j < m. These problems are dual to each other, if we change the direction of all 
the arcs. 

Note: C. H. Papadimitriou points out that this is a special case of a much more 
general theorem. Let N = (n»j) be an m X p matrix of nonnegative integers having 
no row or column entirely zero. We can define l(N) to be the minimum number of 
multiplications needed to compute the set of monomials { ..I"” 3 ' | 1 < j < p}. 

Now l(N ) is also the minimum of arcs — vertices + m taken over all directed graphs 
having m vertices $i whose in-degree is zero and p vertices t 3 whose out-degree is zero, 
where there are exactly n tJ oriented paths from s» to tj for each i and j. By duality 
we have l(N ) = l(N T ) -)-m — p. 

N. Pippenger has proved a comprehensive generalization of the results of exercises 
27 and 32. Let L(m,p, n ) be the maximum of l(N ) taken over all m X p matrices N 
of nonnegative integers mj < n. Then L(m,p,n ) = min(m,p)lgn + H/\gH -f 
0(m + p -\- H(\oglogH) 1/2 (logH)~ 3/2 ), where H = mplg(n -j- 1)- [SIAM J. Com¬ 
puting 9 (1980), 230-250.] 

40. By exercise 39, it suffices to show that i(mirai -f- m t n t ) < /(mi,..., m t ) -f 

l([ni,..., n t ]). But this is clear, since we can first form {x mi ,..., x mt } and then 
compute the monomial (x mi ) ni .. .(x mt ) nt . 

Notes: Theorem F is a special case of this result, since we clearly have l(2 m , x) < 
m -(- v(x) — 1 when X(x) < m. One strong way to state Olivos’s theorem is that if do, 

a r and b 0 , ..., b s are any addition chains, then 1(^2 Cijdibj ) < r -j- s + Yh c v — 1 
for any (r -(- 1) X (s + 1) matrix of nonnegative integers Cij. 

41. [To appear.] The stated formula can be proved whenever A > 9m 2 . Since this 
is a polynomial in m, and since the problem of finding a minimum vertex cover is 
NP hard (cf. Section 7.9), the problem of computing Z(ni,..., n m ) is NP complete. [It 
is unknown whether or not the problem of computing l(n) is NP complete.] 


SECTION 4.6.4 

1. Set y x- x 2 , then compute ((... (u 2 n+iy + u 2n -i)y H- )y + u{)x. 
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2. Replacing x in (2) by the polynomial x -(- xq leads to the following procedure: 

Gl. Do step G2 for k = n, n — 1, ..., 0 (in this order), and stop. 

G 2 . Set vie <— life, and then set v 3 <— v 3 + xoVj+\ for j = k, k + 1, ..., n — 1 . 

(When k = n, this step simply sets v n <— u n .) | 

The computations turn out to be identical to those in HI and H2, but performed in a 
different order. (This application was, in fact, Newton’s original motivation for using 
scheme ( 2 ).) 

3 . The coefficient of x k is a polynomial in y that may be evaluated by Horner’s rule: 

(... (u n ,o%-\-(un —■ * ) :E_ f _ (( • • • { u o,ny-\-uo in —i)y-\- m • ■ [For 

a “homogeneous” polynomial, such as u n x n -j- u n -ix n ~ 1 y -|-f- u\xy n ~ 1 + u 0 y n , 

another scheme is more efficient: if 0 < |z| < |t/|, first divide x by y, evaluate a 
polynomial in x/y, then multiply by y n .] 

4. Rule (2) involves 4 n or 3 n real multiplications and 4 n or In real additions; (3) is 
worse, it takes 4n -j- 2 or 4n + 1 mults, 4n + 2 or An -j- 5 adds. 

5. One multiplication to compute x 2 ; \n/ 2 \ multiplications and \n/2\ additions to 

evaluate the first line; \n/ 2 ] multiplications and [n/ 2 ] — 1 additions to evaluate the 
second line; and one addition to add the two lines together. Total: 1 multiplications 

and n additions. 

6 . Jl. Compute and store the values x%, ..., . 

J 2 . Set Vj <— UjX for 0 < j < n. 

J3. For k — 0, 1, ..., n — 1 , set Vj <— Vj -f- Vj +1 for j — n — 1 , ..., k -j- 1 , k. 

J4. Set Vj <— Vjx\f /2i ~ J for 0 < j < n. | 

There are (n 2 -)-n)/2 additions, n+fn/2] — 1 multiplications, n divisions. Another mul¬ 
tiplication and division can be saved by treating v n and vo as special cases. Reference: 
SIGACT News 7, 3 (Summer 1975), 32-34. 

7. Let Xj = xo -|- jh, and consider (42), (44). Set yj <— u(xj ) for 0 < j < n. For 
k — 1 , 2 , ..., n (in this order), set yj <— yj — yj—i for j = k, k -j- 1 , ..., n (in this 
order). Now (3j = y 3 for all j. 

8 . See (43). 

9. [Combinatorial Mathematics (Buffalo: Math. Assoc, of America, 1963), 26-28.] 
This formula can be regarded as an application of the principle of inclusion and 
exclusion (Section 1.3.3), since the sum of the terms for n — ei — • • • — e n = k is 
the sum of all Xij 1 X 2 32 •. • x n j n for which k values of the j l do not appear. A direct 
proof can be given by observing that the coefficient of x\j 1 ... x n j n is 

£in ; 

if the f s are distinct, this equals unity, but if ji , ..., j n 7 ^ k then it is zero, since the 
terms for Ek = 0 cancel the terms for e* = 1 . 

To evaluate the sum efficiently, we can start with 61 = 1, £2 — • • • = c n = 0, and 
we can then proceed through all combinations of the e’s in such a way that only one 
e changes from one term to the next. (See “Gray code” in Chapter 7.) The work to 
compute the first term is n —1 multiplications; the subsequent 2 n —2 terms each involve 
n additions, then n — 1 multiplications, then one more addition. Total: ( 2 n — l)(n — 1 ) 
multiplications, and (2 n — 2)(n + 1) additions. Only n -\-1 temp storage locations are 
needed, one for the main partial sum and one for each factor of the current product. 
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10 - El <k<n( k + ^(fc+O = n ( 2 n 1 — 1 ) multiplications and J2i<k<n fc (fc+i) = 
n2 n ~ 1 — 2 n -J- 1 additions. This is approximately half as many arithmetic operations 
as the method of exercise 9, although it requires a more complicated program to control 

the sequence. Approximately (|- n / 2 ]) + (r„/ai_ i) temporary storage locations must be 

used, and this grows exponentially large (on the order of 2 n j\Jn). 

The method in this exercise is equivalent to the unusual matrix factorization of 
the permanent function given by Jurkat and Ryser in J. Algebra 5 (1967), 342-357. It 
may also be regarded as an application of (39) and (40), in an appropriate sense. 

12 . Here is a brief summary of progress on this famous research problem: J. Hopcroft 
and L. R. Kerr proved, among other things, that seven multiplications are necessary 
in 2 X 2 matrix multiplication [SIAM J. Appl. Math. 20 (1971), 30-36]. R. L. Probert 
showed that all 7-multiplication schemes, in which each multiplication takes a linear 
combination of elements from one matrix and multiplies by a linear combination of 
elements from the other, must have at least 15 additions [ SIAM J. Computing 5 (1976), 
187-203]. For n = 3, the best method known is due to J, D. Laderman [Bull. Amer. 
Math. Soc. 82 (1976), 126-128], who showed that 23 noncommutative multiplications 
suffice. His construction has been generalized by Ondrej Sykora, who exhibited a 
method requiring n 3 —(n—l ) 2 noncommutative multiplications and n 3 — n 2 -\-ll(n —l ) 2 
additions, a result that also reduces to (36) when n = 2 [ Lecture Notes in Comp. Sci. 
53 (1977), 504-512], The best lower bound known to hold for all n is the fact that 
2 n 2 — 1 nonscalar multiplications are necessary [Jean-Claude Lafon and S. Winograd, 
Theoretical Comp. Sci., to appear]. Pan has generalized this to a lower bound of 
mn -f-ns-f-ra — n — 1 in the m X n X s case [see SIAM J. Computing 9 (1980), 341]. 
The best upper bounds known for large n were changing rapidly as this book was being 
prepared for publication; see exercises 60-64. 

13. By summing geometric series, we find that F(ti ,..., t n ) equals 


E 


0<$1 <mi ,..., 0 <s n <m n 


exp(— 2iri(siti/mi -[- \-s n t n /m n )f{si,..., s n ))/mi ...m r 


The inverse transform times mi ... m n can be found by doing a regular transform and 
interchanging tj with m 3 — 1 3 when t 3 7 ^ 0; cf. exercise 4.3.3-9. 

[If we regard F(ti,... ,t n ) as the coefficient of xj 1 ... x^ n in a multivariate polyno¬ 
mial, the finite Fourier transform amounts to evaluation of this polynomial at roots of 
unity, and the inverse transform amounts to finding the interpolating polynomial.] 

14. Let mi = ■ ■ • = m n — 2, F(t\, £ 2 , • ■ •, t n ) — F{2 n 1 t n -f- • • • -j- 2t 2 -b £i)> and 

f(si, $ 2 ,..., s n ) = f(2 n ~ 1 si -|- 2 n— 2 s 2 -|-Mn); note the reversed treatment between 

£’s and s’s. Also let gk(sk, • • •, s n , tk) be uj raised to the 2 k ~ 1 tk{s n + 2s n —1 + • • • + 
2 n ~ k Sk) power. 

At each iteration we essentially take 2 n—1 pairs of complex numbers (a, ft) and 
replace them by (a ft- ^ft, a — 0), where £ is a suitable power of uj, hence £ = cos# -|- 
i sin 9 for some 6. If we take advantage of simplifications when £ = ±1 or the 
total work comes to ((n — 3) • 2 n ~ 1 + 2) complex multiplications and n • 2 n complex 
additions; the techniques of exercise 41 can be used to reduce the real multiplications 
and additions used to implement these complex operations. 

The number of complex multiplications can be reduced about 25 per cent without 
changing the number of additions by combining passes k and k -f- 1 for k = 1, 3, ...; 
this means that 2 n—2 quadruples (a, /?, 7 , 6) are being replaced by (a-M/3-M 2 7-M 3 <5, 
a-MY/3—^ 2 7 — i$ 3 6, a—$P-F£ 2 T—$ 3 6, The number of complex 

multiplications when n is even is thereby reduced to (3 n — 2 ) 2 n_3 — 5 [ 2 n— 1 / 3 j- 
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These calculations assume that the given numbers F(t) are complex. If the F(t) are 
real, then f(s ) is the complex conjugate of f(2 n — s), so we can avoid the redundancy 
by computing only the 2 n independent real numbers /(0), 9?/( 1), ..., 3i/( 2 n ~ 1 — 1), 
f(2 n ~ 1 ), 3/(1), 3/(2 n—1 — 1). The entire calculation in this case can be done 

by working with 2 n real values, using the fact that f^(s n —k+ i, . • •, s n , ii,..., t n —k) 
will be the complex conjugate of ... ,s' n) t i,... ,t n —k) when (si... s n )2 + 

(s'j... 5^)2 = 0 (modulo 2 n ). About half as many multiplications and additions are 
needed as in the complex case. 

[The fast Fourier transform algorithm is essentially due to C. Runge and H. Konig 
in 1924, and it was generalized by J. W. Cooley and J. W. Tukey, Math. Comp. Iff 
(1965), 297-3011 Its interesting history has been traced by J. W. Cooley, P. A. W. 
Lewis, and P. D. Welch, Proc. IEEE 55 (1967), 1675-1677. Details concerning its use 
have been discussed by R. C. Singleton, CACM 10 (1967), 647-654; M. C. Pease, JACM 
15 (1968), 252-264; G. D. Berglund, Math. Comp. 22 (1968), 275-27$, CACM 11 (1968), 
703-710; A. M. Macnaghten and C. A. R. Hoare, Comp. J. 20 (1977), 78-83. See also 
exercises 53, 57, and 59.] 

15. (a) The hint follows by integration and induction. Let take on all values be¬ 

tween A and B inclusive, as 9 varies from min(xo,..., x n ) to max(xo,..., x n ). Replacing 

by each of these bounds, in the stated integral, yields A/n! < f(x 0 , ...,x n ) < 
B/n\. (b) It suffices to prove this for j = n. Let / be Newton’s interpolation polyno¬ 
mial, then / (n) is the constant n!a n . 

16. Carry out the multiplications and additions of (43) as operations on polynomials. 
(The special case xq = X\ — ■ • • = x n is considered in exercise 2. We have used this 
method in step C 8 of Algorithm 4.3.3C.) 

17. T. M. Vari has shown that n — 1 multiplications are necessary, by proving that 
n multiplications are necessary to compute x\ + • • • -f x„ [Cornell Computer Science 
Report 120 (Jan. 1972)]. 

18. c*o = 3 ( 143/^4 + 1), (3 = U 2 /U 4 — ao(ao — 1), a 4 = ao/3 — u 4 /u 4 , a .2 = 0 — 2a\, 
OiZ — Uq/U4 — Q!i(Q;i -f- CK2), OL 4 = U4. 

19. Since as is the leading coefficient, we may assume without loss of generality that 
u(x) is monic (i.e., u 5 = 1). Then ao is a root of the cubic equation 40 z 3 — 24u 4 z 2 -f- 
(4ul + 2 ^ 3)2 + (u 2 — U 3 U 4 ) = 0; this equation always has at least one real root, and it 
may have three. Once ao is determined, we have otz = u 4 — 4ao, ai = U 3 — 4 aoa 3 — 
6 ao, a2 = U\ — ao(aoai -f- 4 aga 3 -f- 2aia3 -f- a^), ot 4 = uq — a3(ag -[- aiao -f- a2)- 

<3 r * 

For the given polynomial we are to solve the cubic equation 402 — 120 2 4* 
8 O 2 = 0; this leads to three solutions (ao, ai, a 2 , a 3 , a 4 , as) = (0,-10,13,5,—5,1), 
(1, -20, 68 ,1,11,1), (2, -10,13, -3, 27,1). 


20. LDA X 
FADD =a3= 

STA TEMPI 
FADD =ao~oc3= 
STA TEMP2 
FMUL TEMP2 
STA TEMP2 


FADD =ai= 

FMUL TEMP2 
FADD =a2= 

FMUL TEMPI 
FADD =a 4 = 

FMUL =a 5 = | 


21. 2 = (x-f-l)x — 2, w = (x-j— 5)2 —|— 9, u[x) = (w-{-z—S)w —8; or 2 = (x-|-9)x-f-26, 
w — (x — 3)2 -|- 73, u(x) = (w + 2 — 24)tu — 12. 
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22. oce = 1, oto = 1, ol\ = 1, Pi = —2, (3 2 = —2, (3$ = —2, P 4 — 1, 0:3 — —4, 

ot 2 = 0, <24 = 4, a 5 — —2. We form 2 = (x — l)x + 1, w — 2 -f- x, and u(x) = 
{{z — x — 4)w + 4)2 — 2. In this special case we see that one of the seven additions 
can be saved if we compute w — x 2 -\- 1 , z = w — x. 


23. (a) We may use induction on n; the result is trivial if n < 2. If /(0) = 0, then the 
result is true for the polynomial f{z)/z, so it holds for f(z). If f(iy ) — 0 for some real 
faj}- then-fff-l -?^ = hJ4-ifa t he, r esult, ig true for f( z£^U_?j 2 Y tt hold s. 



71-/2 < t < 37r/2 will circle the origin clockwise approximately n/2 times; so the path 
f(it) for —R < t < R must go counterclockwise around the origin at least n/2 — 1 
times. For n even, this implies that f(it) crosses the imaginary axis at least n —2 times, 
and the real axis at least n — 3 times; for n odd, f{it) crosses the real axis at least 
n — 2 times and the imaginary axis at least n — 3 times. These are roots respectively 
of g(it) = 0, h(it) — 0. 

(b) If not, g or h would have a root of the form a + bi with a/0 and 6 / 0. 
But this would imply the existence of at least three other such roots, namely a — bi 
and — a 4: bi, while g{z ) and h(z ) have at most n roots. 


24. The roots of u are —7, —3 4: i, —2 4 i, and —1; permissible values of c are 2 

and 4 (but not 3, since c = 3 makes the sum of the roots equal to zero). Case 1 , c = 2 : 
p(x) = (x + 5)(x 2 -J- 2x + 2)(x 2 + l)(x — 1) = x 6 + 6 x 5 + 6 x 4 + 4x 3 — 5x 2 — 2 x — 10; 
q(x) = 6 x 2 4- 4x — 2 = 6 (x -f- l)(x — ^). Let ot 2 — —1, 0:1 = 3 ; pi(x) = x 4 4- 6 x 3 4~ 
5x 2 - 2x - 10 = (x 2 4- 6 x 4- ^)(x 2 - j) - o 0 - 6 , (3 Q = f, fa = Case 

2 , c = 4: A similar analysis gives 02 = 9, 01 = —3, 00 = — 6 , (3o — 12, fa = —26. 

25. (3\ — 02, (3 2 = 2oi, = 07, (3\ — 06, P 5 = P& = 0, fa = 01 , fa 0, 
Pg = 2oi — Os- 

26. (a) Xi = Oi X Xo, X 2 — 02 4" Xi, X 3 = X 2 X Xo, X 4 = 03 4- X 3 ; X 5 = X 4 X Xo, 
X 6 = O4 4- X 5 - (b) /Cl = 1 4" PlX, K 2 = 1 4“ P 2 K 1 X, K 3 = 1 4- ^ 3 ^ 2 X, u(x) — P 4 K 3 = 
P 1 P 2 P 3 P 4 X 3 -\-P 2 P 3 P 4 x 2 + /? 3 PaX-\-Pa- (c) If any coefficient is zero, the coefficient of x 3 
must also be zero in (b), while (a) yields an arbitrary polynomial 0 ix 3 -|- 02 X 2 + 03 x 4 -a 4 
of degree < 3. 

27. Otherwise there would be a nonzero polynomial f(q n ,... ,qi,qo) with integer coeffi¬ 
cients such that q n • f(qn, ■ ■ ■, <?i, go) = 0 for all sets (q n , ..., qo) of real numbers. This 
cannot happen, since it is easy to prove by induction on n that a nonzero polynomial 
always takes on some nonzero value. (Cf. exercise 4.6.1-16. However, this result is 
false for Unite fields in place of the real numbers.) 


28. The indeterminate quantities ai, ..., a s form an algebraic basis for the polynomial 

domain Q[a 1 ,..., a s ], where Q is the field of rational numbers. Since s —(— 1 is greater 
than the number of elements in a basis, the polynomials ..., q s ) are algebraically 

dependent; this means that there is a nonzero polynomial g with rational coefficients 
such that g{fo(oci,, a s f s (oti,, a s )) is identically zero. 

29. Given jo, ..., jt £ {0,1,..., n}, there are nonzero polynomials with integer 
coefficients such that gj(qj 0 ,... ,q Jt ) — 0 for all ( q n ,... ,qo) in R 3 , t < j < m. The 
product g\g 2 ...g m is therefore zero for all (q n , .. . , qo) in Ri U • • • U Rm. 
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30. Starting with the construction in Theorem M, we will prove that m p - f-(l— < 5 om c ) of 
the P's may effectively be eliminated: If //* corresponds to a parameter multiplication, 
we have = p 2 i —i X (T 2 * -f- P 2 i); add cp 2 i~iP 2 i to each Pj for which Cfii occurs in Tj, 
and replace p 2 i by zero. This removes one parameter for each parameter multiplication. 
If fit is the first chain multiplication, then is the first chain multiplication, then 
ih = ( 7 iz + 0i +P 21 - 1 ) X (i 2 x J r 0 2 -\-p 2 i), where 71 , 72 , 0i, 0 2 are polynomials in Pi, 
..., p 2 i —2 with integer coefficients. Here 61 and 0 2 can be “absorbed” into foi— 1 and 
P 2 i, respectively, so we may assume that 9\ = 62 = 0. Now add c(3 2 i—\(3 2 i to each Pj 
for which c/ii occurs in Tj; add p 2 i—il 2 /h to p 2 i] and set p 2 i— 1 to zero. The result set 
is unchanged by this elimination of p 2 i— 1 , except for the values of a.\, ..., such that 
71 is zero. [This proof is essentially due to V. Ia. Pan, Russian Mathematical Surveys 
21,1 (1966), 105-136.] The latter case can be handled as in the proof of Theorem A, 
since the polynomials with 71 — 0 can be evaluated by eliminating p 2 i (as in the first 
construction, where Hi corresponds to a parameter multiplication). 

31. Otherwise we could add one parameter multiplication as a final step, and violate 
Theorem C. (The exercise is an improvement over Theorem A, in this special case, 
since there are only n degrees of freedom in the coefficients of a monic polynomial of 
degree n.) 


32. Xi = \o X Xo, X 2 = oti X Xi, X 3 — a 2 ~h X 2 , X 4 = X 3 X Xi, X 5 = 0:3 -f- X 4 . 
We need at least three multiplications to compute uax a (see Section 4.6.3), and at least 
two additions by Theorem A. 

33. We must have n+1 < 2m c -\-m p -{-6o rnc , and m c -\-m p = (n-fT)/ 2 ; so there are no 
parameter multiplications. Now the first X l whose leading coefficient (as a polynomial 
in x) is not an integer must be obtained by a chain addition; and there must be at least 
n 1 parameters, so there are at least n -f- 1 parameter additions. 

34. Transform the given chain step by step, and also define the “content” Ci of Xi, as 

follows: (Intuitively, <7 is the leading coefficient of Xi.) Define Co = 1. (a) If the 

step has the form — oij -f Xfc, replace it by X t = P 3 + Xfc, where Pj = otj/ck ; and 
define Ci = Ck . (b) If the step has the form Xi = otj — Xfc, replace it by Xi = Pj + Xfc, 
where Pj = — otj/ck ; and define Ci = —Cfc. (c) If the step has the form Xi = otj X Xfc, 
replace it by Xi = Xfc (the step will be deleted later); and define Ci — OLjCk. (d) If the 
step has the form Xi = X^ X Xfc, leave it unchanged; and define Ci = CjCk. 

After this process is finished, delete all steps of the form Xi = Xfc, replacing Xi 
by Xfc in each future step that uses \j. Then add a final step X r -|_i = p X X r , where 
P = c r . This is the desired scheme, since it is easy to verify that the new Xi are just 
the old ones divided by the factor Ci. The p's are given functions of the a's; division by 
zero is no problem, because if any Ck = 0 we must have c r — 0 (hence the coefficient 
of x n is zero), or else Xfc never contributes to the final result. 

35. Since there are at least five parameter steps, the result is trivial unless there is at 
least one parameter multiplication; considering the ways in which three multiplications 
can form mx A , we see that there must be one parameter multiplication and two chain 
multiplications. Therefore the four addition-subtractions must each be parameter steps, 
and exercise 34 applies. We can now assume that only additions are used, and that 
we have a chain to compute a general monic fourth-degree polynomial with two chain 
multiplications and four parameter additions. The only possible scheme of this type 
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that calculates a fourth-degree polynomial has the form 

Xi = oc\ -f - Xo 
X 2 — OL2 ~\~ Xo 
X3 = Xi X X2 
X4 = 0!3 -f- X3 
X5 = 0!4 + X3 
X6 — X4 X X5 

X 7 == C*5 -|- X6 

Actually this chain has one addition too many, but any correct scheme can be put into 
this form if we restrict some of the a’s to be functions of the others. Now X 7 has the 
form (x 2 -f Ax + B)(x 2 -\-Ax-\-C)-\-D~ x 4 + 2 Ax 3 -f (E -f A 2 )x 2 -f EAx + F, where 
A = oci +(* 2 , B = aiO :2 -f < 23 , C — 0 : 10:2 + «4, D — a&, E = B-\- C, F = BC-f-D; 
and since this involves only three independent parameters it cannot represent a general 
monic fourth-degree polynomial. 

36. As in the solution to exercise 35, we may assume that the chain computes a 
general monic polynomial of degree six, using only three chain multiplications and six 
parameter additions. The computation must take one of two general forms 


Xl — Q!i -(- Xo 

Xi = ou -f- Xo 

X2 =: Oi 2 “l” Xo 

X2 = 0L2 ~h Xo 

X3 = Xi X X2 

X3 = Xi X X2 

X 4 = o;3 -j- Xo 

X4 = 0:3 -j- X3 

X5 = 0:4 + X3 

X5 = 0:4 -(- X3 

X6 = X4 X X5 

Xe = X4 X X5 

X7 = Q5 + X6 

X7 = <25 -b X3 

Xg = cue -|- Xo 

Xs — Qfe + Xg 

X9 = X7 X Xg 

X9 = X7 X X8 

X10 = 0:7 “t~ X9 

X10 — oti -j- X9 


where, as in exercise 35, an extra addition has been inserted to cover a more general 
case. Neither of these schemes can calculate a general sixth-degree monic polynomial, 
since the first case is a polynomial of the form 

(x 3 + Ax 2 + Bx -f C)(x 3 + Ax 2 + Bx -f D) + E, 
and the second case (cf. exercise 35) is a polynomial of the form 

(x 4 + 2 Ax 3 + {E + A 2 )x 2 + EAx + F){x 2 + Ax + G) + H; 
both of these involve only five independent parameters. 

37. Let u^(x) = u n x n + u n — 1 x n ~ 1 +-b uo, u [ 0 ^(x) = x n + v n —ix n ~ 1 -)-(- 

Vo- For 1 < j < n, divide ^(x) by the monic polynomial v^ — 1 '(x), obtaining 
u [j - 1 ] (x) = OLjV^ ^(x) -f- (3jV^\x). Assume that a monic polynomial v^\x) of degree 
n — j exists satisfying this relation; this will be true for almost all rational functions. 
Let u [;] (x) = v^ 3 ~ 1 ] (x) — xv^(x). These definitions imply that deg(iJ n ') < 1 , 
may let oin-f-i — 


so we 
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For the given rational function we have 

OLj 0J v^(x) u^(x) 

1 2 x -h 5 3x + 19 

3 4 1 5 

so u^(x)/v^(x) = 1 + 2/(x + 3 + 4/(x + 5)). 

Notes: A general rational function of the stated form has 2n -f- 1 “degrees of 
freedom,” in the sense that it can be shown to have 2n -\- 1 essentially independent 
parameters. If we generalize polynomial chains to “arithmetic chains,” which allow 
division operations as well as addition, subtraction, and multiplication, we can obtain 
the following results with slight modifications to the proofs of Theorems A and M: An 
arithmetic chain with q addition-subtraction steps has at most q-\-1 degrees of freedom. 
An arithmetic chain with m multiplication-division steps has at most 2 m -f- 1 degrees 
of freedom. Therefore an arithmetic chain that computes almost all rational functions 
of the stated form must have at least 2 n addition-subtractions, and n multiplication- 
divisions; the method in this exercise is “optimal.” 

38. The theorem is certainly true if n — 0. Assume that n is positive, and that a 
polynomial chain computing P(x; u 0 ,..., u n ) is given, where each of the parameters ctj 
has been replaced by a real number. Let \i = \j X Xfc be the first chain multiplication 
step that involves one of uo, ..., u n ; such a step must exist because of the rank of A. 
Without loss of generality, we may assume that \j involves u n ; thus, \j has the form 

hoUo 4-f- hnUn + /(z), where ho, ..., h n are real, h n ^ 0, and f(x) is a polynomial 

with real coefficients. (The h ’s and the coefficients of f(x) are derived from the values 
assigned to the a’s.) 

Now change step i to \i = a X Xfc, where a is an arbitrary real number. (We 
could take a = 0; general a is used here merely to show that there is a certain amount 
of flexibility available in the proof.) Add further steps to calculate 

X = (a — f{x) — h 0 u 0 - h n -iu n -i)/h n ; 

these new steps involve only additions and parameter multiplications (by suitable new 
parameters). Finally, replace X_ n —i — u n everywhere in the chain by this new 
element X. The result is a chain that calculates 

Q{x] Uq, • • • , Un —l) := P{%‘, Uo, • • • , Wn — 1, (Of f{%) hoUo • • • hn—1 Un —1 )/)j 

and this chain has one less chain multiplication. The proof will be complete if we 
can show that Q satisfies the hypotheses. The quantity (o: — f(x))/h n leads to a 
possibly increased value of m, and a new vector B'. If the columns of A are Ao, A 1} 

..., A n (these vectors being linearly independent over the reals), the new matrix A' 
corresponding to Q has the column vectors 

Ao [ho / hn)A n , ■ ■ ■, A n —1 {hn —1 / hn)A n , 

plus perhaps a few rows of zeros to account for an increased value of m, and these 
columns are clearly also linearly independent. By induction, the chain that computes Q 
has at least n — 1 chain multiplications, so the original chain has at least n. 

[Pan showed also that the use of division would give no improvement; cf. Problemy 
Kibernetiki 7 (1962), 21-30. Generalizations to the computation of several polynomials 
in several variables, with and without various kinds of preconditioning, have been given 
by S. Winograd, Comm. Pure and Applied Math. 23 (1970), 165-179.] 
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39. By induction on m. Let w m (x) = x 2m -f u 2 m— ix 2m 1 -|-f- u 0 , w m — i(x) — 

x 2m 2 -|- V 2 m— 3 X 2rn 3 -)-••• -f- fo, a = ot\ -\- 7 . m , b = ctm, and let 

f(r) = Ei,i>o (-lY+’i^Ur+i+yaW. 

It follows tll&t Vr — j f (t "I” 2) for T ^ Oj £Uld forn — /(I). If 6 m = 0 and a is given, we 
have a polynomial of degree m — 1 in b, with leading coefficient 41 (^ 2 ™—i — rna) = 
±(72 H-h 7 m — mym). 

In Motzkin’s unpublished notes he arranged to make 6k = 0 almost always, by 
choosing 7 ’s so that this leading coefficient is 7 ^ 0 when m is even and = 0 when m is 
odd; then we almost always can let b be a (real) root of an odd-degree polynomial. 

40. No; S. Winograd found a way to compute all polynomials of degree 13 with only 7 

(possibly complex) multiplications [Comm. Pure and Applied Math. 25 (1972), 455-457]. 
L. Revah found schemes that evaluate almost all polynomials of degree n > 9 with 
[n/2j 1 (possibly complex) multiplications [SIAM J. Computing 4 (1975), 381-392]; 

she also showed that when n = 9 it is possible to achieve [n/2j 4-1 multiplications only 
with at least n-f 3 additions. By appending sufficiently many additions (cf. exercise 39), 
the “almost all” and “possibly complex” provisos disappear. V. Ia. Pan [ Proc. ACM 
Symp. Theory Comp. 10 (1978), 162-172; IBM Research Report RC7754 (1979)] found 
schemes with [n/ 2 j-|-l (complex) multiplications and the minimum number n+ 2 -|-< 5 n 9 
of (complex) additions, for all odd n > 9; his method for n = 9 is 

v{x ) = ((2 + a ) 2 -f P)(x + 7 ), w(x) = v{x) + x, 
t{x) = (v(x) + < 5 )(w( 2 ) -f e) — (v{x) + <5')(u;(:r) 4 - e'), 
u{x) = (v{x) + f)(i(x) 4 - r}) 4 - k. 

The minimum number of real additions necessary, when the minimum number of (real) 
multiplications is achieved, remains unknown for n > 9 . 

41. a(c 4 -d) — (a 4- b)d 4- i(a(c -f- d) 4~ (fr — a)c). [Beware numerical instability. Three 
multiplications are necessary, since complex multiplication is a special case of (69) with 
p(u) = u 2 4 - 1 . Without the restriction on additions there are other possibilities. For 
example, the symmetric formula ac — bd-\- i((a 4- b)(c 4- d) — ac — bd) was suggested 
by Peter Ungar in 1963; cf. Eq. 4.3.3-2 with 2 n replaced by i. See I. Munro, Proc. 
ACM Symp. Theory Comp. 3 (1960), 40-44; S. Winograd, Linear Alg. Appl. 4 (1971), 
381-388.] 

Alternatively, if a 2 4- b 2 = 1 and i = (l — a)/b = b/{\ -\- a), the algorithm u w — 
c — td, v = d-\-bw, u — w — tv" for calculating the product (a 4 - bi)(c 4 - di) — u-\-iv 
has been suggested by Oscar Buneman [J. Comp. Phys. 12 (1973), 127-128]. In this 
method if a = cos 6 and b = sin0, we have t = tan( 0 / 2 ). 

[Helmut Alt and Jan van Leeuwen have shown that four real multiplications or 
divisions are necessary for computing l/(a 4 - bi), and four are sufficient for computing 
a/(b 4~ ci). It is unknown whether (a -f- bi)/{c 4 - di) can be computed with only five 
multiplications or divisions.] 

42. (a) Let 717 , ..., 7r m be the XTs that correspond to chain multiplications; then 717 = 

P 2 i —1 X P 21 and u(x) = P 2 m-\- 1 , where each Pj has the form (3j {3jox-\-pjini 4-b 

PjrV)Kr(j), where r(j) < \j/2] — 1 and each of the Pj and P 3 k is a polynomial in the 
a’s with integer coefficients. We can systematically modify the chain (cf. exercise 30) 
so that p 3 — 0 and P Jr (j) = 1 , for 1 < j < 2 m; furthermore we can assume that 
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/? 30 = 0. The result set now has at most m + 1 + ^2 1 <j< 2m {\j/^] — 1 ) = m 2 + 1 
degrees of freedom. 

(b) Any such polynomial chain with at most m chain multiplications can be 
simulated by one with the form considered in (a), except that now we let r(j) — 
|j/ 2 "| — 1 for 1 < j < 2 m -f- 1 , and we do not assume that /? 3 o = 0 or that Pj r (j) = 1 
for j > 3. This single canonical form involves m 2 + 2m parameters. As the cc’s run 
through all integers and as we run through all chains, the p's run through at most 
2 m2+2m gets of values mod 2, hence the result set does also. In order to obtain all 2 n 
polynomials of degree n with 0-1 coefficients, we need m 2 -f- 2 m > n. 

(c) Set m <— [-y/nj and compute x 2 , x 3 , ..., x m . Let u(x) = + 

• Ui(x)x m -(- uo(x), where each Uj(x) is a polynomial of degree < m with integer 
coefficients (hence it can be evaluated without any more multiplications). Now evaluate 
u(x) by rule (2) as a polynomial in x m with known coefficients. (The number of additions 
used is approximately the sum of the absolute values of the coefficients, so this algorithm 
is efficient on 0-1 polynomials. Paterson and Stockmeyer also gave another algorithm 
that uses about y/2n multiplications.) 

Reference: SIAM J. Computing 2 (1973), 60-66; see also J. E. Savage, SIAM J. 
Computing 3 (1974), 150-158. For analogous results about additions, see Borodin and 
Cook, SIAM J. Computing 5 (1976), 146-157; Rivest and Van de Wiele, Inf. Proc. 
Letters 8 (1979), 178-180.] 

43. When a* = a,- -f- a*, is a step in some optimal addition chain for n -(- 1, compute 

x l = x j x k and pi — pkX 3 -\-pj, where pi = x l ~ 1 -j-(-i-(-l; omit the final calculation 

of x nJrl . We save one multiplication whenever ak — 1, in particular when i = 1 . (Cf. 
exercise 4.6.3-31 with e = £.) 

44. It suffices to show that (7ijfc)’s rank is at most that of (tijk), since we can obtain 
(tijk) back from (Tijk) by transforming it in the same way with F~ 1 , G ~ x , H ~ l . If 
tijk = Yh\<i<r ail ^^ lCkl ^ en ^ f°U° ws immediately that 

Tijk = El <i<r(£ l<p<m^P a P f )(El<g<n ^'«^)(El<r<s HkrCrl). 

[H. F. de Groote has proved that all normal schemes that yield 2X2 matrix 
products with seven chain multiplications are equivalent, in the sense that they can be 
obtained from each other by nonsingular matrix multiplication as in this exercise. In 
this sense Strassen’s algorithm is unique.] 

45. By exercise 44 we can add any multiple of a row, column, or plane to another one 
without changing the rank; we can also multiply a row, column, or plane by a nonzero 
constant, or transpose the tensor. A sequence of such operations can always be found to 
reduce a give 2 X 2 X 2 tensor to one of the forms (J J)(J J), (J S)(o o)> (o i)(o o)» (o o)(o i)> 
(-)(-). The last tensor has rank 3 or 2 according as the polynomial u 2 —ru — q has 
one or two irreducible factors in the field of interest, by Theorem W (cf. (72)). 

46. A general m X n X s tensor has mns degrees of freedom. By exercise 28 it is 
impossible to express all m X n X s tensors in terms of the (m -}- n + s)r elements of 
a realization A, B, C unless (m n + s)r > mns. On the other hand, assume that 
m > n > s. The rank of an m X n matrix is at most n, so we can realize any tensor 
in ns chain multiplications by realizing each matrix plane separately. [Exercise 45 
shows that this lower bound on the maximum tensor rank is not best possible, nor is 
the upper bound. Thomas D. Howell (Ph. D. thesis, Cornell Univ., 1976) has shown 
that there are tensors of rank > \mns/(m n + 5 — 2 )] over the complex numbers.] 
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48. If A, B, C and A', BC' are realizations of {Ujk) and {t' ijk ) of respective lengths 
r and r', then A" = A © A', B" = B © B', C" = C® C\ and A!" = A® A’, 
B"' = B®B', C" — C ®C' , are realizations of and {t'” k ) of respective lengths 
r -J- / and r • r'. 

Note: Many people have made the natural conjecture that rank((0fc) 0 (t' ljk )) ~ 
rank(0fc) 0 rank(£' ijfc ), but the construction in exercise 60(b) makes this seem much 
less plausible than it once was. 

49. By Lemma T, rank(0jt) > rank(£*(-,*;)). Conversely if M is a matrix of rank r 
we can transform it by row and column operations, finding nonsingular matrices F 
and G such that FMG has all entries 0 except for r diagonal elements that are 1; cf. 
Algorithm 4.6.2N. The tensor rank of FMG is therefore < r; and it is the same as 
the tensor rank of M, by exercise 44. 

50. Let i = ( i ', i") where 1 < i' < m and 1 < i" < n; then t^> ti ^) jk = S^jS^k, and 
it is clear that rank(£i( jfc )) = mn since {ti( jk )) is a permutation matrix. By Lemma L, 
rank {Ujk) > mn. Conversely, since {Ujk) has only mn nonzero entries, its rank is 
clearly < mn. (There is consequently no normal scheme requiring fewer than the mn 
obvious multiplications. There is no such abnormal scheme either [Comm. Pure and 
Appl. Math. 3 (1970), 165-179]. But some savings can be achieved if the same matrix 
is used with s > 1 different column vectors, since this is equivalent to {m X n) times 
(n X s ) matrix multiplication.) 

51. (a) si = y 0 + t/i, s 2 = yo — yp, mi = %(x 0 0 ii)s 1} m 2 = — xi )s 2 ; w Q = 

mi0m 2 , Wi = mi—m 2 . (b) Here are some intermediate steps, using the methodology 
in the text: (( x 0 — x 2 ) 0 (xi — x 2 )u)((t / 0 — 2 / 2 ) 0 (j/i — y 2 )u) mod (u 2 0 « 01 ) = 
((x 0 — x 2 )(y 0 — y 2 ) — (xi — x 2 )(yi — 3 / 2 )) 0 ((x 0 — x 2 )(y 0 — y 2 ) — (xi — x 0 )(yi — y 0 ))u. 
The first realization is 

/I 1 I 0\ /I 1 I 0\ /I 1 1 2\ 

1011 , 1011 , 1121 X-. 

V 1101 y Viioiy V 12117 6 

The second realization is 

/111 2\ x (X 1 1 0\ /1 1 I 0\ 

I 112 1 ) X 1101 1011 . 

V 1211 7 V 1011 7 Viioiy 

The resulting algorithm computes si = yo 0 yi, s 2 — yo — yi, S 3 = y 2 — yo, s 4 = 
3/2 — Vi, s 5 = si 0 y 2 \ mi = 0x o 0 X\ 0 x 2 )s 5 , m 2 = 0x o 0 Xi — 2x 2 )s 2 , m 3 — 
3 (xo — 2 xi 0 x 2 )s 3 , m 4 — g(— 2 xo 0 Xi 0 x 2 )s 4 ; ti = mi 0 m 2 , t 2 — mi — m 2 , 
is = mi 0 m 3 , wo = t\ — m 3 , w\ = t$ 0 m 4 , w 2 — t 2 — m 4 . 

52. Let i = (i',i") when zmodn' — i' and imodn" — i". Then we wish to compute 
W{k',k' f ) = Yl x {i',i")y( 3 ',j") summed for i' 0 / = k' (modulo n!) and i" 0 j" = k" 
(modulo n"). This can be done by applying the n' algorithm to the 2 n' vectors X^ 
and Yj> of length n", obtaining the n r vectors W k >. Each vector addition becomes n" 
additions, each parameter multiplication becomes n" parameter multiplications, and 
each chain multiplication of vectors is replaced by a cyclic convolution of degree n". 
[If the subalgorithms use the minimum number of chain multiplications, this algorithm 
uses 2(n' — d(n'))(n" — d{n")) more than the minimum, where d(n) is the number of 
divisors of n.j 
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sSl (a) Let n(k ) = (p — l)p e ~ k ~ 1 = (p(p e ~ k ) for 0 < k < e, and n(k) = 1 for k > e. 
Represent the numbers {1,..., m} in the form a l p fc (modulo m), where 0 < k < e and 
0 < i < n(k), and a is a fixed primitive element modulo p e . For example, when m = 9 
we can let a = 2; the values are {2°3°, 2 1 3°, 2°3 1 ,2 2 3°, 2 5 3°, 2 1 3 1 , 2 4 3°, 2 3 3°, 2°3 2 }. 
Then f{a l p k ) = J2o<i<e T,o<j<n(i) U}9{i ’ j ’ k ’ l)F ( aj P l ) where M) = a t+3 p k+l . 

We shall compute f ik i = E 0 <;< n (j) w,W,fc,l ^( a, P l ) for 0 < * < n(fc) an ^ for 

i fc 4— £ 

each k, l. This is a cyclic convolution of degree n(k -(- Z) on the values Xi = uj 0, p 
and y, = Eo<i<n (! ),» +3 = 0 (moduio „(*+!)) *Vp‘), since /<*, = £*rj/, summed over 
r-j-s = i (modulo n(k-\-l)). The Fourier transform is obtained by summing appropriate 
fiki’s. [Note: When linear combinations of the Xi are formed, e.g., as in (67), the result 
will be purely real or purely imaginary, when the cyclic convolution algorithm has been 
constructed by using rule (57) with u n ^ — 1 = {u n ^ k ^ 2 — 1 )(u n ^ fc ^ 2 -(-1). The reason is 
that reduction mod (u n(fc)/2 — 1) produces a polynomial with real coefficients uj 3 -\-uj~ 3 
while reduction mod (u n ^ k ^ 2 -f 1) produces a polynomial with imaginary coefficients 

UJ 3 — u )~ J .] 

When p = 2 an analogous construction applies, using the representation (— l) l a 3 2 k 
(modulo m), where 0 < A: < e and 0 < i < min(e— k, 1) and 0 < j < 2 e ~ k ~ 2 . In this 
case we use the construction of exercise 52 with n' = 2 and n" = 2 e ~ k ~ 2 ; although 
these numbers are not relatively prime, the construction does yield the desired direct 
product of cyclic convolutions. 

(b) Let a'm! -J- a!'m" — 1; and let u/ = c o a " rn ", uj" = Define s' = smod 

m!, s" = smodm", t' = tmodm', t" = imodm", so that uj st = (u/) s 1 {w") s t ■ 

It follows that f(s',s") = Eo<t'<m'o<t'km"( w ') s,t, KT ,V/f (0''); in other words, 
the one-dimensional Fourier transform on m elements is actually a two-dimensional 
Fourier transform on m! X m" elements, in slight disguise. 

We shall deal with “normal” algorithms consisting of (i) a number of sums Si of 
the F’s and s’s; followed by (ii) a number of products rrij, each of which is obtained 
by multiplying one of the F’s or S’s by a real or imaginary number a 3 \ followed by 
(iii) a number of further sums t k , each of which is formed from m’s or t’s (not F’s or 
s’s). The final values must be m’s or V s. For example, the “normal” Fourier transform 
scheme for m = 5 constructed from (67) and the method of part (a) is as follows: s i = 
F(l) -j- F(4), S 2 == F(3) -|- F(2), S 3 = si -f- S 2 , S 4 = Si — S 2 , S 5 = F(l) — ■F(4), S& = 
F( 2 )—F(3), s 7 = s 5 — s 6 ; mi — |(u;+u; 2 +w 4 +a; 3 )s 3 , m 2 = \(uj—l> 2 —uj 3 )s 4 , 
ms = 0 -\-oj 2 — u) 4 —o; 3 )s5, m 4 = ^(—uj -j -uj 2 -{-uj 4 —w 3 )s 6 , ms — 5(u> 3 — uj 2 )st, 

me — 1 • F(5), m 7 — 1 ■ S 3 ; to = mi + me, ti = to + m 2 , — ms Z 3 = 

to — m 2 , t4 = m4 — ms, ts = ti -)- 1 2 , t6 = t3 4" Z 4 , tj = ti — 12, t8 = t3 — t4, t9 = 

me + m 7 . Note the multiplication by 1 shown in m« and m 7 ; this is required by our 
conventions, and it is important to include such cases for use in recursive constructions 
(although the multiplications need not really be done). Here me = fooi, m 7 = / 010 , 
1 5 = fooo + fool — /(2°), te = /100 + /101 = /(2 1 ), etc. We can improve the scheme 
by introducing sg = S 3 4 - ^(5), replacing mi by (4(w 4- w 2 -f- 4“ w 3 ) — l)s 3 [this 

is —|S 3 ], replacing me by 1 ■ Ss, and deleting m 7 and tg; this save's one of the trivial 
multiplications by 1 , and it will be advantageous when the scheme is used to build 
larger ones. In the improved scheme, /(5) = me, /(1) = ts, /(2) = te, /(3) = tg, 
/(4) = t 7 . 

Now suppose we have normal one-dimensional schemes for m! and m", using 
respectively ( aa") complex additions, (£', t") trivial multiplications by 4:1 or 4 zh and 
a total of (F, c") complex multiplications including the trivial ones. (The nontrivial 
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complex multiplications are all “simple” since they involve only two real multiplications 
and no real additions.) We can construct a normal scheme for the two-dimensional 
m' X m" case by applying the m' scheme to vectors F{t', *) of length m". Each Si 
step becomes m" additions; each rrij becomes a Fourier transform on m" elements, 
but with all of the a’s in this algorithm multiplied by a 3 ) and each tk becomes m" 
additions. Thus the new algorithm has (a'm" + c'a") complex additions, t't" trivial 
multiplications, and a total of c'c" complex multiplications. 

Using these techniques, Winograd has found normal one-dimensional schemes for 
the following small values of m with the following costs (a, t, c): 


m = 2 ( 2 , 2 , 2 ) 

m = 3 ( 6,1, 3) 
m — 4 ( 8, 4, 4) 
m = 5 (17,1,6) 


m = 7 (36,1, 9) 

m — 8 (26,6, 8) 

m = 9 (46,1,12) 

m = 16 (74,8,18) 


By combining these schemes as described above, we obtain methods that use fewer 
arithmetic operations than the “fast Fourier transform” (FFT) discussed in exercise 14. 
For example, when m = 1008 = 7-9-16, the costs come to (17946,8,1944), so we can do 
a Fourier transform on 1008 complex numbers with 3872 real multiplications and 35892 
real additions. It is possible to improve on Winograd’s method for combining relatively 
prime moduli by using multidimensional convolutions, as shown by Nussbaumer and 
Quandalle in IBM J. Res. and Devel. 22 (1978), 134-144; their ingenious approach 
reduces the amount of computation needed for 1008-point complex Fourier transforms 
to 3084 real multiplications and 34668 real additions. By contrast, the FFT on 1024 
complex numbers involves 14344 real multiplications and 27652 real additions. If the 
two-passes-at-once improvement in the answer to exercise 14 is used, however, the FFT 
on 1024 complex numbers needs only 10936 real multiplications and 25948 additions, 
and it is not difficult to implement. Therefore the subtler methods are faster only on 
machines that take significantly longer to multiply than to add. 

[References: Proc. Nat. Acad. Sci. USA 73 (1976), 1005-1006; Math. Comp. 32 
(1978), 175-199; Advances in Math. 32 (1979), 83-117.] 

54. max(2eideg(pi) — 1,..., 2e q deg (p q ) — 1, q + 1). 

55. 2n' — cf, where n' is the degree of the minimum polynomial of P (i.e., the monic 
polynomial p of least degree such that p(P) is the zero matrix) and r' is the number 
of distinct irreducible factors it has. (Reduce P by similarity transformations.) 

56. Let Ujk + tjik — njk -f- Tjik, for all i, j, k. If A, B, C is a realization of (tijk) of 

rank r, then J2i<i< r Ckl E fl <i z»)E b i l tijkXiXj — y Ti jk x l x j for all k. 

Conversely, let the Ith chain multiplication of a polynomial chain, for 1 < l < r, be the 
product (oti -j- ^2 otuXi)(pi -(- XI where on and Pi denote possible constant terms 

and/or nonlinear terms. All terms of degree 2 appearing at any step of the chain can 
be expressed as a linear combination Yli<i<r Cl ('l2 ailXi )(J2 b J lX j)’ hence the chain 
defines a tensor (tijk) of rank < r such that Ujk-\~tjik — Tijk~\~Tjik- This establishes the 
hint. Nowrankfajfc+Tjt*) = rank^fc-j-t/ifc) < rank(Uiit)+rank(U;fc) = 2rank(Ujfc). 

A bilinear form in x\, ..., x m , y\, ..., y n is a quadratic form in m -f- n variables, 
where Tijk = tij—m,k for i < m and j > m, otherwise r l3 k = 0. Now rank^*:) -j- 
rank(Tji/c) > rankfi^jt), since we obtain a realization of (tijk) by suppressing the last 
n rows of A and the first m rows of B in a realization A, B, C of (r^k + tjik). 
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57. Let TV be the smallest power of 2 that exceeds 2 n, and let u n +i = ■ • • = 


UN— 1 = V n + 1 = 


Vn- 


= o. if u, = E„ 


<t<N 


U) 


“u t , Vi = Eo 


<t<N 


u st v t , 


0 < s < TV, UJ — e 2nt/N , then !] 0<5<N w st U s V a = where the latter 

sum is taken over all £i and £2 with 0 < £i ,£2 < TV, £1 -(- £2 = £ (modulo TV). The 
terms vanish unless £1 < n and £2 < n, so £1 +£2 < TV; thus the sum is the coefficient 
of z t in the product u(z)v(z). If we use the method of exercise 14 to compute the 
Fourier transforms and the inverse transforms, the number of complex operations is 
0(TV log TV) + 0(TV log TV) + O(TV) + 0(TV log TV); and TV < An. [Cf. Section 4.3.3 and 
the paper by J. M. Pollard, Math. Comp. 25 (1971), 365-374.] 

When multiplying integer polynomials, it is possible to use an integer number u> 
that is of order 2* modulo a prime p, and to determine the results modulo sufficiently 
many primes. Useful primes in this regard, together with their least primitive roots r 


(from which we take oj 


— „(P- 1 )/ 2 * 


modp when pmod2* = 1), can be found as 


described in Section 4.5.4. For £ = 9, the ten largest cases < 2 35 are p = 2 35 —512a~Ll, 
where (o,r) = (28,7), (31,10), (34,13), (56,3), (58,10), (76,5), (80,3), (85,11), (91,5), 

(101.3) ; the ten largest cases < 2 31 are p = 2 31 — 512a + L where (a, r) = (1,10), 

(11.3) , (19,11), (20,3), (29,3), (35,3), (55,19), (65,6), (95,3), (121,10). For larger £, all 
primes p of the form 2 t q + 1 where q < 32 is odd and 2 24 < p < 2 36 are given by 
(p— l,r) = (11 ■ 2 21 ,3), (25 • 2 20 , 3), (27 • 2 20 , 5), (25 • 2 22 ,3), (27 ■ 2 22 ,7), (5 • 2 25 ,3), 
(7 • 2 26 , 3), (27 • 2 26 ,13), (15 • 2 27 , 31), (17 • 2 27 , 3), (3 ■ 2 30 , 5), (13 • 2 28 ,3), (29 • 2 27 , 3), 
(23 • 2 29 , 5). Some of the latter primes can be used with u — 2 e for appropriate small e. 
For a discussion of such primes, see R. M. Robinson, Proc. Amer. Math. Soc. 9 (1958), 
673-681; S. W. Golomb, Math. Comp. 30 (1976), 657-663. 

However, the method of exercise 59 will almost always be preferable in practice. 


58. (a) In general if A, B, C realizes {Ujk), then (xi,..., x m )A, B, C is a realization 
of the 1 X n X s matrix whose entry in row j, column k is ^ XiUjk. So there must be 
at least as many nonzero elements in (x \,..., x m )A as the rank of this matrix. In the 
case of the m X n X (m n — 1) tensor corresponding to polynomial multiplication 
of degree m — 1 by degree n — 1 , the corresponding matrix has rank n whenever 
(xi,..., x m ) 7 ^ (0,..., 0). A similar statement holds with A «-► B and m +-+ n. [In 
particular, if we work over the field of 2 elements, this says that the rows of A modulo 
2 form a “linear code” of m vectors having distance at least n, whenever A, B, C is 
a realization consisting entirely of integers. This observation, due to R. W. Brockett 
and D. Dobkin [Linear Algebra and its Applic. 19 (1978), 207-235, Theorem 14], can 
be used to obtain nontrivial lower bounds on the rank over the integers. For example, 
M. R. Brown and D. Dobkin have used it to show that realizations ofnXn polynomial 
multiplication over the integers must have £ > 3.52n, for all sufficiently large n; see 
IEEE Trans. C-29 (1980), 337-340.] 


1 0 0 0 0 0 0 0\ 

I I 0 0 0 1 0 0 
I 1 I 0 0 0 1 0 
1 0 0 I I I I 1 
00101000 
0 0 0 1 0 0 0 0 / 

5®. [IEEE Trans. ASSP-28 (1980), 205-215.] Note that cyclic convolution is polyno¬ 
mial multiplication mod u n —1, and negacyclic convolution is polynomial multiplication 
mod u n -\- 1. Let us now change notation, replacing n by 2 n ; we shall consider recursive 
algorithms for cyclic and negacyclic convolution (zq, ..., Z 2 n —i) of (xo,. ■ ■, x 2 n~-i) with 


T 0 0 0 0 1 1 1 
(b) [ 0 1 0 0 1 1 0 1 
,0 0 1 1 0 0 1 1 



1000011b 
01000101 
00100011 
0 0 0 1 1 0 0 1 ' 
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{yo, ■ ■ ■, U 2 n ~ 1). The algorithms are presented in unoptimized form, for brevity and 

ease in exposition; the reader who implements them will notice that many things can 

be streamlined. 

Cl. [Test for simple case.] If n = 1, set zq <— zoyo+xiyi, Z\ <— (z 0 +Zi)(yo+yi) — z 0 , 
and terminate. Otherwise set m <— 2 n—1 . 

C2. [Remainderize.] For 0 < k < m, set ( x k ,x m+k ) «- {x k + x m +k,x k — x m+k ) 
and (y k , y m + k ) <— (; y k + ym+k, y k — Vm + k ). (Now we have x(u) mod (u m — 1) = 
^o-f'•' + Xm—\u Tn ~ 1 and z(u) mod (« m + 1) — x m -h * • ‘ X2m—i; we will 
compute x{u)y{u) mod ( u m — 1) and x{u)y{u) mod {u m -|- 1), then we will combine 
the results by (57).) 

C3. [Recurse.] Set (zo,...,z m - i) to the cyclic convolution of (z 0 ,..., i m -i) with 
{yo, • • • ,2/m— i). Also set {z m, • • *, % 2 m —i) to the negacyclic convolution of 
{x m , . . . , X 2 m —l) With ( y-m, ■ ■ • j y 2 m — l)- 

C4. [Unremainderize.] For 0 < k < m, set {z k , z m + k ) <- \{z k + z m + k , z k — z m + k ). 
Now (zo ,..., Zm— i) is the desired answer. | 

Nl. [Test for simple case.] If n = 1, set t <— Xo{y 0 + yi), zo <— t — (z 0 + x i)yi, <— 
t + (zi — xo)yo, and terminate. Otherwise set m <— 2 1 - n/2 - J and r <— 2^ n/2 \ (The 
following steps use 2 n+1 auxiliary variables Xij for 0 < i < 2m and 0 < j < r, 

to represent 2m polynomials Xi{w) — Xio+Xnw-\ -bXi( r _i )W r ~ 1 ; similarly, 

there are 2 n+1 auxiliary variables Yij.) 

N2. [Initialize auxiliary polynomials.] Set Xij <- X( i+m )j <- x m j+i Yij Y( l+m )j <- 
Vmj+i, for 0 < i < m and 0 < j < r. (At this point we have x{u) = Xo{u m ) + 

uXi(u m ) +-1- u rn ~ 1 X m -i{u m ), and a similar formula holds for y(u). Our 

strategy will be to multiply these polynomials modulo ( u mr -f-1) = {u n -(- 1), by 
operating modulo ( w r 1) on the polynomials X(w) and Y(w), finding their cyclic 
correlation of length 2m and thereby obtaining x{u)y{u) = Zo(u m ) -|- uZi{u m ) -f 
...+ U 2m ~ l Z 2 m-l{u m ).) 

N3. [Transform.] (Now we will essentially do a fast Fourier transform on the poly¬ 
nomials (Xo,... ,Xm—i, 0,..., 0) and (Yo,. •., Y m —i, 0,..., 0), using w r ^ m as a 
(2m)th root of unity. This is efficient, because multiplication by a power of w 
is not really a multiplication at all.) For j = \n/2\ — 1, ..., 1, 0 (in this or¬ 
der), do the following for all m binary numbers s +1 = (s[ n /2j • • • Sj+iO ... 0)2 + 
(0...0tj_ 1 ...£ 0)2: Replace (X s +t(iy),X s+t+2 j'(^)) by the pair of polynomials 
(X,+ t (w) + w (r/m ^X, +t+2i (w),X. +t (w) - w^ m ^X, +t+2i (w)). (See 
Section 4.3.3 and Eq. 4.3.3-33. The operation Xi{w) <- Xi(w) -f- w k Xi(w) means, 
more precisely, that we set Xij «— Xij Xi(j+ k ) if j + k < r, otherwise Xij <— 
Xij — Xj(j+fc_ r ), for 0 < j < r.) Do the same transformation on the T’s. 

N4. [Recurse.] For 0 < i < 2m, set (Zio,..., Zi( r — 1)) to the negacyclic convolution of 
{XiQ f ..., Xi(r —1)) and (Yio; • * • , Ynj— i))- 

N5. [Untransform.] For j = 0, 1, ..., \n/2\ (in this order), set (Z s +t(w), Z s+t _ 

^{Z s+ t{w)-\-Z s+ t+ 2 i{w),w~ ir/m)ia/2) {Z s -i-t{w) — Z s+t+2 j{w))), for all m choices 
of s and t as in step N3. 

N6. [Repack.] (Now we have accomplished the goal stated at the end of step N2, since 
it is easy to show that the transform of the Z's is the product of the transforms of 
the X^ s and the Y^ s.) Set z% ^ ZiQ —1) and Zmj-\-i ^ Zij~\~ Z^fYi-\-i){j —1) 

for 0 < j < r, for 0 < i < m. | 
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It is easy to verify that at most n extra bits of precision are needed for the 
intermediate variables in this calculation; i.e., if |:r;| < M for 0 < i < 2 n at the 
beginning of the algorithm, then all of the x and X variables will be bounded by 2 71 M 
throughout. 

Algorithm N performs A n addition-subtractions, D n halvings, and M n multiplica¬ 
tions, where A\ = 5, D\ = 0, Mi = 3; for n > 1 we have A n = \nj c l\T' Jr2 + 
2 ln/ 2 j+i ALn/2j + ^ n/2j + 1 ) 2 n +i 4 . 2 n , D n = 2 ^/ 2 J + 1 D Ln/2J + (|»/2J + l) 2 n +\ 

and M n = 2 Ln/ 2 J+ 1 M i n/ 2 j. The solutions are A n = 11 -2 n - 1+rign1 -3• 2 n + 6- 2 n S n , 
D n = 4- 2 n “ 1+rignl - 2 • 2 n + 2 • 2 n S n , M n = 3 ■ 2 n - 1 +^ lgn ^ ; here satisfies the 
recurrence Si = 0, S n = 2S[ n /2"| + L^/2j, and it is not difficult to prove the inequalities 
^nflgn] < S n < ^nlgn -f n. Algorithm C does approximately the same amount of 
work as Algorithm N. 

It would be interesting to find a simpler way to carry out the additions and 
subtractions in step N3 (and the reverse operations in N5), perhaps analogous to Yates’s 
method. The operation Xi <— Xi -\-w k Xi sketched above can be done with a procedure 
that generalizes the data-rotation algorithm of Fletcher and Silver in CACM 9 (1966), 
326, but there might be a better way. 

60. (a) In Ei, for example, we can group all terms having a common value of j and k 
into a single trilinear term; this gives v 2 trilinear terms when (j, k) E ExE, plus v 2 
when (j, k) E EXO and v 2 when (j, k) E OxE. When j = k we can also include 
— x 33 VjJ z Jj i n Si, free of charge. [In the case n = 10, the method multiplies 10 
by 10 matrices with 710 noncommutative multiplications; this is fewer than required 
by any other known method, although Winograd’s scheme (35) uses only 600 when 
commutativity is allowed.] 

(b) Here we simply let S be all the indices (i, j, k ) of one problem, S the indices of 
the other. [When m = n = s = 10, the result is quite surprising: We can multiply two 
separate 10 X 10 matrices with 1300 noncommutative multiplications, while no scheme 
is known that would multiply each of them with 650.] 

(c) Corresponding to the left-hand side of the stated identity we get the terms 


x i+e,j- K Uj+C.k+i 7 z k+v,i+£~\- X 3-\-V,k-ir<L Uk- l-e.i+f z i+(,j+ V T - x k~\-^,i+ri Vi+r],j+e ^j-H.fc+c 


summed over (i,j, k) E S and 0 < e, r? < 1, so we get all the trilinear terms of the 
form XijyjkZki except when \i/ 2] = \j/2] — [fc/2]; however, these missing terms can 
all be included in Ei, E 2 , or E 3 . The sum Ei turns out to include terms of the form 
Xi+e,j- i-f Vi+v.j+e times some sum of z’s, so it contributes 8i/ 2 terms to the trilinear 
realization; and E 2 , E 3 are similar. To verify that the aBC terms cancel out, note 
that they are J^(— iy~ hv Xi-^ e ,jj rC yk+i,i+q Zj+ tl k+s, so rj — 1 cancels with rj = 0. 
[This technique leads to asymptotic improvements over Strassen’s method whenever 
^n 3 +6n 2 — |n < n lg7 , namely when 36 < n < 184, and it was the first construction 
known to break the lg 7 barrier. Reference: SIAM J. Computing 9 (1980), 321-342.] 

61 . (a) Replace aij(u ) by uau(u). (b) Let a t i(u ) = au^u^, etc., in a polynomial 
realization of length r = rank d (£r/fc)* Then Ujk — E^+^dEKKr^^^' 
[This result can be improved to rank^fc) < (2d+ l)rank d (t^fc) in an infinite field, be¬ 
cause the trilinear form a ^ c o corresponds to multiplication of polynomials 

modulo u d+1 , as pointed out by Bini and Pan.] (c,d) This is clear from the realizations 
in exercise 48. 
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62. The rank is 3, by the method of proof in Theorem W with ? = J). The border 

rank cannot be 1, since we cannot have ai(u)b\(u)ci(u) = ai(u)b 2 {u)c 2 (u) = u d and 
ai(u)b 2 {u)ci(u) = ai{u)b\(u)c 2 {u) = 0 (modulo u d+1 ). The border rank is 2 because 
of the realization (J J),(“?), (J ~i). 

63. (a) Let the elements of T(m, n, s ) and T(M, N, S) be denoted by fyzjoo.fc'XM') an( i 
T{i > j'){j i k , ){k,i , ) i respectively. Each element 7]of the direct product, 
where I — (i, I), J = (j, J ), and K = (k, K), is equal to t {l!j > ){jtk >) {k y)T {Ii j> ){ j iK > ){Ki i >> 
by definition, so it is 1 iff I' — I and J' = J and K' = K. 

(b) We have M(mns) < r 3 , since T(mns, mns, runs) = T(m, n, s) (x) T(n, s, m) 0 
T(s,m, n). If M(P ) < R we have M(P h ) < R h for all h, and it follows that M{N ) < 
M(P^ ]ogp N ^) < i?^ ogpN l < J RiV logJ?/logP . [This result appears in Pan’s 1972 paper.] 

(c) We have Md(mns ) < r 3 for some d, where Md{n) = rank d{T{n, n, n)). If 
Md{P) < R we have Mhd(P h ) < for all h, and the stated formula follows since 
M(P h ) < ( hd ^ 2 ^R h by exercise 61. In an infinite field we save a factor of log AT. [This 
result is due to Bini and Schonhage, 1979.] 

(d) Let P — mns ; we can perform p 3h independent P h X P h matrix multi¬ 
plications with at most ^ ld + 2 ^p 3h r 3h noncommutative scalar multiplications. Reduce 

M(P 2h ) to M(P h ) matrix multiplications of size P h X P h ; thus we have M(P 2h ) < 
(^ d ^~ 2 ^ r 3h M(P h ){l 0{p/r) 3h ). Iterating this recurrence gives 

M(N) = 0{N p(m ’ n ’ s ’ r) ) ■ exp(0(log log n) 2 ). 

64. (a) Let a — b — Vjki -B — Llj, c — uz k ij C — Zj k , so that 

E 2 — 0(w 2 ) can be eliminated. [When m = s = 7 and n = 1, this gives M(7V) = 
0(7V 2 - 66 ).] (b) Take Qi = s — 1, 02 — -af 2 ) 03 — = —1, 05 = 1, 06 = s — 1, 

and d > 4. [We assume that a -1 exists in the field.] (c) Taking the direct product of 
T(m, n, 2s) 0 T(2n, s, m ) 0 T(s, 2m, n ) with itself 3 h times gives a tensor whose border 
rank is at most (2(m 0 l)n(s 0 2)) 3 \ This tensor is the direct sum of 3 3h terms of the 
form T(m l (2ny s k , nV(2m) fc , (2s) l m 3 n k ) where i-\-j-\-k = 2>h, and (3 h)\/h\ 3 of these 
have i = j = k = h. Thus if we let P — (2 mns) h and p — (3 h)\/h\ 3 , the border rank 
of pT(P, P, P) is at most pr, where 

r = (2(m 0 l)n(s 0 2)) 3h /p. 

Exercise 63(d) now implies that M(N) — o(N (3 ^ p,p,p,r ^ +€ ) for all e > 0; here P and r 
are functions of h. We complete the proof by letting h be large in (3{P, P, P, r ) — 
logr/logP — (3/ilog2(m 0 l)n(s 0 2) — 3hlog3 0 O(log h))/(3h\og 2mns), which 
equals 0(m, n, 2s, §(m 0 l)n(s 0 2)) 0 0((log h)/h). [The best value is obtained for 
m = 5, n = 1, s = 11, 0 — 31og 110 52 < 2.522.] 


SECTION 4.7 

1. Find the first nonzero coefficient V m , as in (4), and divide both U(z) and V(z) 
by z m (shifting the coefficients m places to the left). The quotient will be a power 
series iff Uq = • ■ • = U m —1 — 0. 
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2. We have V 0 ” + I W n = V£U„ - (V r 0 1 Wb)(V 0 "- , K„) - (Vo - 

(VgWn— i)(V 0 0 Vi). Thus, we can start by replacing (Uj,Vj) by [V&Uj, Vh~ 1 Vj) for 
j > 1, then set W n <— U n — £o<t<nW^—* fOT » > °- finally replace W? by 

Wj/V o +1 for 7 > 0. Similar techniques are possible in connection with other algorithms 
in this section. 

3. Yes. When a = 0, it is easy to prove by induction that W\ = W2 = • • • = 0. 
When a = 1 , we find W n — V n , by the “cute” identity 

k ~ ^ k A VkVn _ k = Vo Vn. 



4. If W{z) = e v{z \ then W'{z) = V'(z)W(z); we find W 0 = 1, and 

W n = T - V k Wn-k , for n > 1. 
n 

1 < fc <n 


If w(2) = ln(l + y(z)), then VT(e) + W’{z)V{z) = V'{z); the rule is W 0 = 0, and 
Wi + 2 W 2 z + • • • = F'(2)/(l -f V-(z)). 

[By exercise 6, the logarithm can be obtained to order n in 0(n log n) operations. 
R. P. Brent observes that exp(V'(z)) can also be calculated with this asymptotic speed 
by applying Newton’s method to f(x) = Inx — V(z); therefore general exponentiation 
(l-bF(^)) a = exp(aln(l-fy(2))) is 0(n log n) too. Reference: Analytic Computational 
Complexity, ed. by J. F. Traub (New York: Academic Press, 1975), 172-176.] 

5. We get the original series back again. This can be used to test a reversion 
algorithm. 

6. <p(x) = x -\- x(l — xV(z)), cf. Algorithm 4.3.3R. Thus after Wo, ..., Wm_i 

are known, the idea is to input Vn, . •., V2N—1, compute (Wo -\ -H Wn— iZ N ~ 1 )X 

(Vo H-1-V 2 n-iz 2n_1 ) = l+flo^ N H-hi?N-iZ 2N_I + 0(z 2N ), and letW N + 

• • • + WiN-tz”- 1 = —(Wo + • • • + W N _, zH-'KRo + ■ • • + Rn-iZ n ~ 1 ) + 0 (z n ). 
[. Numer. Math. 22 (1974), 341-348; this algorithm was, in essence, first published by M. 
Sieveking, Computing 10 (1972), 153-156.] Note that the total time for N coefficients 
is 0(N log N) if we use “fast” polynomial multiplication (exercise 4.6.4-57). 

7. W n — when n = (m — 1 )k -f- 1, otherwise 0. (Cf. exercise 2.3.4.4-11.) 

8. Input G\ in step LI, and G n in step L2. In step L4, the output should be 

(U n —iGi-\-2U n -2G2-\ - \~nUoG n )/n. (The running time of the order N 3 algorithm 

is hereby increased by only order N 2 . The value W\ = G\ might be output in step LI.) 

Note: Algorithms T and N determine V~ 1 {U{z))\ the algorithm in this exercise 
determines G(V~ 1 (z)), which is somewhat different. Of course, the results can all be 
obtained by a sequence of operations of reversion and composition (exercise 11), but it 
is helpful to have more direct algorithms for each case. 

9. n — 1 n = 2 n — 3 n — 4 n = 5 


Tm 1 

Thn 

Tsn 

Tau 

7~5 n 


1 2 5 14 

1 2 5 14 

1 3 9 

1 4 

1 
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10. Form y x/a = x(l + a\X -f a 2 x 2 4- ) 1/a = x(l + ax + c 2 x 2 -j-) by means 

of Eq. (9); then revert the latter series. (See the remarks following Eq. 1.2.11.3-11.) 

11. Set Wo <— Uo, and set (T k ,W k ) <— (14,0) for 1 < k < TV. Then for n = 1, 
2, ..., TV, do the following: Set Wj <— Wj + U n T 3 for n < j < TV; and then set 
Tj <— Tj—iVi -f - ■' * 4~ T n Vj— n for j = TV, TV — 1, ..., n - f— 1. 

Here T(z) represents V(z) n . An on-line power series algorithm for this problem, 
analogous to Algorithm T, could be constructed, but it would require about TV 2 /2 
storage locations. There is also an on-line algorithm that solves this exercise and needs 
only O(N) storage locations: We may assume that 14 — 1, if U k is replaced by U k V k 
and 14 is replaced by 14/14 for all k. Then we may revert V(z) by Algorithm L, using 
its output as input to the algorithm of exercise 8 with G\ = U\, G 2 = U 2 , etc., thus 
computing U((V~ 1 ) —l (z)) — Uo■ 

Brent and Kung have constructed several algorithms that are asymptotically faster. 
For example, we can evaluate U(x) for x = V(z) by a slight variant of exercise 4.6.4- 
42(c), doing about 2\fn chain multiplications of cost M(n) and about n parameter 
multiplications of cost n, where M(n) is the number of operations needed to multiply 
power series to order n; the total time is therefore 0(\/nM(n) -j- n 2 ) = 0(n 2 ). A 
still faster method can be based on the identity U(Vo(z ) z m Vi(z)) — U(Vo(z )) + 

z m U'(yo{z))Vi{z)+z 2m U"(yo{z))Vi(z) 2 l2\-\ -, extending to about n/m terms, where 

we choose m ^ \J n/logn; the first term U{yo{z)) is evaluated in 0(mn(logn) 2 ) 
operations using a method somewhat like that in exercise 4.6.4-43. Since we can go from 
U w (Vo(z)) to U^ k+1 \Vo{z)) in 0 (n log n) operations by differentiating and dividing 
by Vq{z), the entire procedure takes 0(mn(logn) 2 (n/m) n log n) ~ 0(nlogn) 3/2 
operations. [JACM 25 (1978), 581-595.] 

12. Polynomial division is trivial unless m > n > 1. Assuming the latter, the equation 
u(x ) = q(x)v(x) + r(x) is equivalent to U(z ) = Q(z)V(z)-\-z m ~ n+1 R(z) where U(x) = 
x rn u(x~ 1 ), V(x) = x n t»(x —1 ), Q(x) = x w_n q(x~ 1 ), and R(x) = x n ~ 1 r(x —1 ) are the 
“reverse” polynomials of u, v, q, and r. 

To find q(x) and r(x), compute the first m — n + 1 coefficients of the power series 
U(z)/V(z) = W(z) + 0(z m ~ n+1 ); then compute the power series U(z) — V(z)W(z), 

which has the form z m ~ n+1 T(z) where T(z) = To + T x z -|-. Note that Tj — 0 for 

all j > n; hence Q(z) — W(z) and R(z) = T(z) satisfy the requirements. 

13. Apply exercise 4.6.1-3 with u(z ) = z N and v(z ) = Wo -(-••• + Wn—iZ N ~ 
the desired approximations are the values of v^(z)/v 2 (z) obtained during the course of 
the algorithm. Exercise 4.6.1-26 tells us that there are no further possibilities with 
relatively prime numerator and denominator. If each W z is an integer, an all-integer 
extension of Algorithm 4.6.1C will have the desired properties. 

Notes: The case TV = 2n-f-1 and deg^x) — deg(w2) = n is of particular interest, 
since it is equivalent to a so-called Toeplitz system. The method of this exercise 
can be generalized to arbitrary rational interpolation of the form W(z) = p(z)/q(z) 
(modulo (z — Zi)... (z — ^n)), where the Zi s need not be distinct; thus, we can specify 
the value of W(z) and some of its derivatives at several points. See Fred G. Gustavson 
and David Y. Y. Yun, IBM Research Report RC7551 (1979). 

14. If U(z) = z -j- UkZ k + • * • and V(z) = z k 4~ T4+i z k+1 4- ■ ■ •, we find that 
the difference V(U(z)) — U'(z)V(z) is J2j>i z2k+j ~ 1 j{UkVk+j — Uk+j 4- (polynomial 
involving only U k , ■■■, U k +j- 1, V fc +i, ..., V k +j- 1)); hence V(z ) is unique if U(z) is 
given and U(z) is unique if V(z) and U k are given. 




658 ANSWERS TO EXERCISES 


4.7 


The solution depends on two auxiliary algorithms, the first of which solves the 
equation V(z + z k U{z)) = (l+z k - 1 W(z))V{z) + z k - 1 S(z) + 0(z k ~ l+n ) for V(z) = 
Vo + Viz + • • • + Vn—iz"— 1 , given U(z), W(z), S{z), and n. If n = 1, let Vo = 
—S{0)/W(0); or let V 0 = 1 when £(0) — W(0) = 0. To go from n to 2 n, let 

V(z + z k U(z)) = (I + 2 fc -'iV(2))V(a) + z k ~'S(z) - z k ~ 1+n R(z) + 0(z k ~ l+2n ), 

1 + z k ~ l w(z) = (*/(z + z k U(z))) n ( 1 + /-V( Z )), 

S(z) = (z/(z + z k U(z))) n R(z), 

and let V(z) = V„ + K+1Z H-1- V 2 „_i 2 n_1 satisfy 

V(z + z k U(z )) = (I + z^'W^Vi*) + 2 t - 1 S( 2 ) + 0( 2 k -‘ +n ). 

The second algorithm solves W(z)U(z) + zU'(z) = V{z) + 0{z n ) for t/(^) — 

Uq-\-U\Z-\ - \-Un—iz n ~ 1 , given V(z), W(z), and n. If n = 1, let U 0 = ^(OJ/^O), 

or let Uo — 1 in case y(0) = W(0) — 0. To go from n to 2n, let W(z)U{z) + zU'(z) — 
V{z ) — z n R(z) + 0{z 2n ), and let U(z) = U n H-b £/ 2 n-i2 n-1 be the solution to 

the equation (n -f- W(z))U(z) -f- (z) = R(z ) -)- 0(z n ). 

Resuming the notation of (27), the first algorithm can be used to solve V(U(z )) = 
U'(z)(z/U(z)) k V(z) to any desired accuracy, and we set V(z) = z k V(z). To find 
P(z), suppose we have V(P{z)) = P'{z)V(z) + 0(z 2k ~ 1+n ), an equation that holds 
for n = 1 when P(z) = z -f- otz k and a is arbitrary. We can go from n to 2 n by 
letting V(P{z)) = P\z)V{z) + z 2k ~ 1+n R{z ) + 0(z 2k ~ 1+2n ) and replacing \ P{z) by 
P(z) -f- z k+n P(z), where the second algorithm is used to find the polynomial P(z) such 
that (k-\-n — zV\z)/V{z))P{z) + zP\z) = (z k /V{z))R(z) + 0(z n ). 
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TABLES OF 
NUMERICAL QUANTITIES 


Table 1 


QUANTITIES THAT ARE FREQUENTLY USED IN STANDARD SUBROUTINES 
AND IN ANALYSIS OF COMPUTER PROGRAMS (40 DECIMAL PLACES) 



V2 = 1.41421 

35623 

73095 

04880 

16887 

24209 

69807 

85697— 


= 1.73205 

08075 

68877 

29352 

74463 

41505 

87236 

69428+ 


V^5 = 2.23606 

79774 

99789 

69640 

91736 

68731 

17623 

54406+ 


V^O = 3.16227 

76601 

68379 

33199 

88935 

44432 

71853 

37196— 


\f2 = 1.25992 

10498 

94873 

16476 

72106 

07278 

22835 

05703— 


^3 = 1.44224 

95703 

07408 

38232 

16383 

10780 

10958 

83919— 


\/2 = 1.18920 

71150 

02721 

06671 

74999 

70560 

47591 

52930— 


In 2 = 0.69314 

71805 

59945 

30941 

72321 

21458 

17656 

80755+ 


In 3 = 1.09861 

22886 

68109 

69139 

52452 

36922 

52570 

46475— 


In 10 = 2.30258 

50929 

94045 

68401 

79914 

54684 

36420 

76011+ 


l/ln2 = 1.44269 

50408 

88963 

40735 

99246 

81001 

89213 

74266+ 


1/lnlO = 0.43429 

44819 

03251 

82765 

11289 

18916 

60508 

22944— 


7r = 3.14159 

26535 

89793 

23846 

26433 

83279 

50288 

41972— 

1° 

= tt/180 — 0.01745 

32925 

19943 

29576 

92369 

07684 

88612 

71344+ 


1/t r = 0.31830 

98861 

83790 

67153 

77675 

26745 

02872 

40689+ 


7r 2 = 9.86960 

44010 

89358 

61883 

44909 

99876 

15113 

53137— 


= r(l/2) = 1.77245 

38509 

05516 

02729 

81674 

83341 

14518 

27975+ 


r(l/3) = 2.67893 

85347 

07747 

63365 

56929 

40974 

67764 

41287— ^ 


r(2/3) = 1.35411 

79394 

26400 

41694 

52880 

28154 

51378 

55193+ ^ 


e = 2.71828 

18284 

59045 

23536 

02874 

71352 

66249 

77572+ 


1/e = 0.36787 

94411 

71442 

32159 

55237 

70161 

46086 

74458+ 


e 2 = 7.38905 

60989 

30650 

22723 

04274 

60575 

00781 

31803+ 


7 = 0.57721 

56649 

01532 

86060 

65120 

90082 

40243 

10422 


InTr = 1.14472 

98858 

49400 

17414 

34273 

51353 

05871 

16473— 


0 — 1.61803 

39887 

49894 

84820 

45868 

34365 

63811 

77203+ 


e 1 = 1.78107 

24179 

90197 

98523 

65041 

03107 

17954 

91696+ 


e^ /4 = 2.19328 

00507 

38015 

45655 

97696 

59278 

73822 

34616+ 


sin 1 = 0.84147 

09848 

07896 

50665 

25023 

21630 

29899 

96226— 


cos 1 = 0.54030 

23058 

68139 

71740 

09366 

07442 

97660 

37323+ 


—f'(2) = 0.93754 

82543 

15843 

75370 

25740 

94567 

86497 

78979— ^ 


f(3) — 1.20205 

69031 

59594 

28539 

97381 

61511 

44999 

07650— ^ 


In 0 = 0.48121 

18250 

59603 

44749 

77589 

13424 

36842 

31352— 


1/In0 = 2.07808 

69212 

35027 

53760 

13226 

06117 

79576 

77422— 


— Inin 2 = 0.36651 

29205 

81664 

32701 

24391 

58232 

66946 

94543— 
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Table 2 

QUANTITIES THAT ARE FREQUENTLY USED IN STANDARD SUBROUTINES 
AND IN ANALYSIS OF COMPUTER PROGRAMS (44 OCTAL PLACES) 

The names at the left of the equal signs are given in decimal notation. 

0.1 = 0.06314 63146 31463 14631 46314 63146 31463 14631 4632 
0.01 = 0.00507 53412 1 7270 24365 60507 53412 1 7270 24365 6051 
0.001 — 0.00040 61115 64570 65176 76355 44264 16254 02030 4467 
0.0001 = 0.00003 21556 13530 704U 54512 75170 33021 15002 3522 
0.00001 = 0.00000 24761 32610 70664 36041 06077 1 7401 56063 3442 
0.000001 = 0.00000 02061 57364 05536 66151 55323 07746 44470 2603 
0.0000001 = 0.00000 00153 27745 15274 53644 12741 72312 20354 0215 
0.00000001 = 0.00000 00012 57143 56106 04303 47374 77341 01512 6333 
0.000000001 = 0.00000 00001 04560 27640 46655 12262 71426 40124 2174 
0.0000000001 = 0.00000 00000 06676 33766 35367 55653 37265 34642 0163 
\[2 = 1.32404 74631 77167 46220 42627 66115 46725 12575 1744 
V3 = 1.56663 65641 30231 25163 54453 50265 60361 34073 4222 
= 2.17067 36334 57722 47602 57471 63003 00563 55620 3202 

VlQ = 3.12305 40726 64555 22444 02242 57101 41466 33775 2253 

V2 = 1.20505 05746 15345 05342 10756 65334 25574 22415 0303 
V$ = 1.34233 50444 22175 73134 07363 76133 05334 31147 6012 
V2 = 1.14067 74050 61556 12455 72152 64430 60271 02755 7314 
In 2 — 0.54271 02775 75071 73632 57117 07316 30007 71366 5364 
In 3 — 1.06237 24752 55006 05227 32440 63065 25012 35574 5534 
In 10 = 2.23273 06735 52524 25405 56512 66542 56026 46050 5071 
1/ln 2 = 1.34252 16624 53405 77027 35750 37766 40644 35175 0435 
1 /In 10 = 0.33626 75425 11562 41614 52325 33525 27655 14756 0622 
7r = 3.11037 55242 10264 30215 14230 63050 56006 70163 2112 
1° = tt/180 = 0.01073 72152 11224 72344 25603 54276 63351 22056 1155 
1/tt = 0.24276 30155 62344 20251 23760 4 7257 50765 15156 7007 
tt 2 = 11.6751714467 62135 71322 25561 15466 30021 40654 3410 
v/tt = r(l/2) = 1.61337 61106 64736 65247 47035 40510 15273 34470 1776 
T(l/3) = 2.53347 35234 51013 61316 73106 47644 54653 00106 6605 
T(2/3) = 1.26523 57112 14154 74312 54572 37655 60126 23231 0245 
e = 2.55760 52130 50535 51246 52773 42542 00471 72363 6166 
1/e = 0.27426 53066 13167 46761 52726 75436 02440 52371 0336 
e 2 = 7.30714 45615 23355 33460 63507 35040 32664 25356 5022 
7 = 0.44742 14770 67666 06172 23215 74376 01002 51313 2552 
In t r = 1.11206 40443 47503 36413 65374 52661 52410 37511 46O6 
(P = 1.47433 57156 27751 23701 27634 71401 40271 66710 1501 
e 1 = 1.61772 13452 61152 65761 22477 36553 53327 1 7554 2126 
e 7r/4 = 2.14275 31512 16162 52370 35530 11342 53525 44307 0217 
sin 1 = 0.65665 24436 O4414 73402 03067 23644 11612 07474 1451 


cos 1 = 0.42450 50037 32406 42711 07022 14666 27320 70675 1232 
—C'(2) = 0.74001 45144 53253 42362 42107 23350 50074 46100 2771 
f(3) = 1.14735 00023 60014 20470 15613 42561 31715 10177 0662 

in <f> = 0.36630 26256 61213 01145 13700 4 IOO 4 52264 30700 4065 

1/ln 0 = 2.04776 60111 17144 41512 11436 16575 00355 43630 4065 

— In In 2 = 0.27351 71233 67265 63650 1 7401 56637 26334 31455 5701 
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Several of the values in Tables 1 and 2 were computed by John W. Wrench, 
Jr. For high-precision values of constants not found in this list, see J. Peters, Ten 
Place Logarithms of the Numbers from 1 to 100000, Appendix to Volume 1 (New 
York: F. Ungar Publ. Co., 1957); and Handbook of Mathematical Functions, ed. 
by M. Abramowitz and I. A. Stegun (Washington, D.C.: U.S. Gov’t Printing 
Office, 1964), Chapter 1 . 

A table of Bernoulli numbers through B 2 50 appears in a paper by D. E. 
Knuth and T. J. Buckholtz, Math. Comp. 21 (1967), 663-688. 


Table 3 

VALUES OF HARMONIC NUMBERS, BERNOULLI NUMBERS, 
AND FIBONACCI NUMBERS, FOR SMALL VALUES OF n 


n 

Hn 

B n 

F n 

n 

0 

0 

1 

0 

0 

1 

1 

-1/2 

1 

1 

2 

3/2 

1/6 

1 

2 

3 

11/6 

0 

2 

3 

4 

25/12 

—1/30 

3 

4 

5 

137/60 

0 

5 

5 

6 

49/20 

1/42 

8 

6 

7 

363/140 

0 

13 

7 

8 

761/280 

—1/30 

21 

8 

9 

7129/2520 

0 

34 

9 

10 

7381/2520 

5/66 

55 

10 

11 

83711/27720 

0 

89 

11 

12 

86021/27720 

-691/2730 

144 

12 

13 

1145993/360360 

0 

233 

13 

14 

1171733/360360 

7/6 

377 

14 

15 

1195757/360360 

0 

610 

15 

16 

2436559/720720 

—3617/510 

987 

16 

17 

42142223/12252240 

0 

1597 

17 

18 

14274301/4084080 

43867/798 

2584 

18 

19 

275295799/77597520 

0 

4181 

19 

20 

55835135/15519504 

—174611/330 

6765 

20 

21 

18858053/5173168 

0 

10946 

21 

22 

19093197/5173168 

854513/138 

17711 

22 

23 

444316699/118982864 

0 

28657 

23 

24 

1347822955/356948592 

—236364091/2730 

46368 

24 

25 

34052522467/8923714800 

0 

75025 

25 
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For any x, let H x — 



1 

n-\-x 


Then 


Hi /2 — 2 — 2 In 2, 

H 1/3 = 3 — Jtt/\/ 3 — § In 3, 

^ 2/3 = § + J»/V5 § In 3, 

H \/4 = 4 — J7T — 3 In 2, 

H 3/4 = § ~h Jtt — 3 In 2, 

H 1/5 = 5 - 4*^(2+ 0/5 - 4(3 - 0In5 - (0 - 4)ln(2 + 0, 
tf 2 /5 = f - hv/tVlTt- 4(2 + 01n5 + (0- 4)ln(2 + 0, 
# 3/5 = | + 4^M vT+?- 4(2 + 0In5 + (* - 4)ln(2 + 0, 

# 4/5 = | + 4^0\/(2 + 0/5 — 4(3 ~ 0 to 5 — (0 — 4 l n (2 + 0i 
tf 1/6 = 6 — 471+3 — 2 In 2 — § In 3, 

Hs /e = f + 4 7r v / 3 — 2 In 2 — § In3, 

and, in general, when 0 < p < q (cf. exercise 1.2.9-19), 

H p / q = - — -7r cot -7r — In 2<? -f- 2 cos ^ In sin -tt. 

' P 2 q i<^ q /2 * <? 
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INDEX TO NOTATIONS 


In the following formulas, letters that are not further qualified have the following 
significance: 

j, k integer-valued arithmetic expression 

m, n nonnegative integer-valued arithmetic expression 

x, y, z real-valued arithmetic expression 

/ real-valued function 

( I Section 


Formal symbolism 

Meaning 

reference 

1 

end of algorithm, program, or proof 

1.1 

A n 

the nth element of linear array A 


Amn 

the element in row m and column n of 
rectangular array A 


A[n\ 

equivalent to An 

1.1 

A[m, n\ 

equivalent to A mn 

1.1 

1/<-E 

give variable V the value of expression E 

1.1 

U++V 

interchange the values of variables U and V 

1.1 

(B => E; E') 

conditional expression: denotes E if B is 



true, E' if B is false 

8.1 

$kj 

Kronecker delta: (j = k =» 1; 0) 

1 .2.6 


sum of all f(k) such that the variable k is an 


R{k) 

integer and relation R(k) is true 

1.2.3 

n /« 

product of all f(k) such that the variable k 


R(k) 

is an integer and relation R(k) is true 

1.2.3 

min f(k) 

minimum value of all f(k) such that the var¬ 


R(k) 

iable k is an integer and relation R(k) is true 

1.2.3 

max f{k) 

maximum value of all f(k) such that the var¬ 


R(k) 

iable k is an integer and relation R(k) is true 

663 

1.2.3 
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Formal symbolism 

Meaning 

Section 

reference 

A r 

transpose of rectangular array A: 



A T [j,k] =A[k,j] 

1.2.3 

x y 

x to the y power (when x is positive) 

1.2.2 

x k 

x to the fcth power: 



(fc > 0 =* JJ x; l/x~ k ^ 

0<j<k 

1.2.2 

X* 

x upper k: 



(/c > 0 JJ (x + j); i/(x + fc)“ fc ) 

0<j<k 

1.2.6 

A 

x lower k: 



(k>Q^ JJ [x — j)] 1 /(x —fc)^) 

0 <j<k 

1.2.6 

n! 

n factorial: n- 

1.2.5 

/'(*) 

derivative of / at x 

1.2.9 

f"(x ): 

second derivative of / at x 

1.2.10 

/ (n) (x) i 

nth derivative: (n = 0 =* /(x); y'(x)). 



where g(x) = / ( ' n ~ 1 ^(x) 

1.2.11.2 

H {x) 

harmonic number of order x: ^ l//c x 

1 <fc<n 

1.2.7 

H n 

harmonic number: 

1.2.7 

F n 

Fibonacci number: 



(n < 1 =4 n; F n _! -f F n _ 2 ) 

1.2.8 

B n 

Bernoulli number 

1.2.11.2 

X-Y 

dot product of vectors X — (xi, ..., x n ) 
and Y = (?/!,..., y n ): X12/1 4 -~f x n y n 


( • • • — 1 • ■ • )fa 

radix -5 positional notation: ^2 k dkb k 

4.1 

(min xi, ave i 2l 

a random variable having minimum 


max x 3 , dev X4) 

value xi, average (“expected”) value x 2 , 



maximum value x 3 , standard deviation X4 

1.2.10 
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Formal symbolism 


Meaning 


Section 

reference 



n 


^TL\ , 72,2 j • • • j 


n 

m 


{:} 


binomial coefficient: (A: < 0 0; x-/k\) 

multinomial coefficient (defined only when 

n = rii -\- ri2 -h n m ) 

Stirling number of the first kind: 


^ ^ k\ k 2 • • • kyi — 

0 <C i &2 ^ ^ — m ^ ^ 

Stirling number of the second kind: 


m 


y k 1 k 2 .■ - k 

1 /c i Jc 2 ^ fcn — ^71, m 


n—m 


1.2.6 


1 . 2.6 


1 . 2.6 


1.2.6 


IXi,X 2 ,...,Xnl 


((*)) 

1*1 

Pll 

[x\ 

r*i 

{ a | /?(a)} 

j..., dji } 
{*} 

{y,z) 

<*»> 


l°g6* 


lgl 

lnx 


continued fraction: 

1 /(Zl + 1/(^2 + l/(-h l/(®n) •■•))) 

sawtooth function 

absolute value of x: (x > 0 =► x; — x) 
cardinal: the number of elements in set S 
floor of x, greatest integer function: ma xfc 

k < x 

ceiling of x, least integer function: min k 

k>x 

set of all a such that the relation R(a) is true 

the set { at | 1 < k < n } 

fractional part (used in contexts where a 
real value, not a set, is implied): x — [x\ 

half-open interval: {x \ y < x < z} 

the infinite sequence X 0 , Xi, X 2 , ... 

(here the letter n is part of the symbolism) 

logarithm, base b , of x (real positive x 
and b, where b ^ 1): the y such that x = b y 

binary logarithm: log 2 x 

natural logarithm: log e x 


4.5.3 

3.3.3 


1.2.4 

1.2.4 


3.3.3 


1.2.9 

1.2.2 

1.2.2 

1.2.2 
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Formal symbolism 

Meaning 

Section 

reference 

expx 

exponential of z: e x 

1.2.2 

z mod y 

mod function: (y = 0 =► x; x — y[x/y\) 

1.2.4 

w(x) mod t'(x) 

remainder of polynomial u after division by 



polynomial v 

4.6.1 

x = y (modulo z) 

relation of congruence: x mod z = y mod z 

1.2.4 

j\k 

j divides k : fcmodj = 0 

1.2.4 

B{x, y) 

beta function 

1.2.6 

1 

Euler’s constant: \im n ^oo(H n — In n) 

1.2.7 

i{x,y) 

incomplete gamma function 

1.2.11.3 

m 

gamma function: lim y _>oo 7(2;, y) 

1.2.5 

6 {x) 

characteristic function of the integers 

3.3.3 

e 

base of natural logarithms: 5Zn>o l/ 71 - 

1.2.2 


zeta function: lim n _>oo Hffl (when x > 1) 

1.2.7 

e(u) 

leading coefficient of polynomial u 

4.6 

l{n) 

length of shortest addition chain for n 

4.6.3 

Mn) 

von Mangoldt’s function 

4.5.3 

K n ) 

Mobius function 

4.5.2 

0(/(n)) 

big-oh of as the variable n-*■ 00 

1.2.11.1 

0 (f(x)) 

big-oh of f{x), for small values of the 


variable x (or for x in some specified range) 

1.2.11.1 

Pin) 

Euler’s totient function: ^ 1 

0 < k < n 
gcd(fc,n)=l 

1.2.4 

7 r 

circle ratio: En>o (—l ) n /( 2 n + 1) 


<t> 

golden ratio: J(l —|- \/5) 

1.2.8 

0 

empty set: { x | 0 = 1} 


00 

infinity: larger than any number 
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Formal symbolism 

Meaning 

Section 

reference 

det(A) 

determinant of square matrix A 

1.2.3 

gcd (j, k) 

greatest common divisor of j and k: 



\j = k = 0 => 0; max d) 

4.5.2 


V d\j,d\k J 


lcm(j, k) 

least common multiple of j and k: 



(j = k = 0 =4 0; min d] 

4.5.2 


V d>0,j\d,k\d ) 


sign(x) 

sign of x: (x = 0 =>■ 0; x/|x|) 


Pr(S(n)) 

probability that statement S(n) is true, for 



“random” integers n (here the letter n is part 



of the symbolism) 

3.5, 4.2.4 

mean(g) 

mean value of the probability distribution 



represented by generating function g: g'{ 1) 

1.2.10 

var(g) 

variance of the probability distribution 



represented by generating function g: 



+ A i) - ?'(i ) 2 

1.2.10 

deg (u) 

degree of polynomial u 

4.6 

cont(u) 

content of polynomial u 

4.6.1 

pp(ti(x)) 

primitive part of polynomial u 

4.6.1 

5R(w) 

real part of complex number w 


^s(w) 

imaginary part of complex number w 


w 

complex conjugate: 9fc(iu) — ^s(w) i 


U 

one blank space 

1.3.1 

rA 

register A (accumulator) of MIX 

1.3.1 

rX 

register X (extension) of MIX 

1.3.1 

rll,..., rI6 

(index) registers 11, ..., 16 of MIX 

1.3.1 

rJ 

(jump) register J of MIX 

1.3.1 

(L:R) 

partial field of MIX word, 0 < L < R < 5 

1.3.1 

OP ADDRESS,1(F) 

notation for MIX instruction 

1.3.1, 1.3.: 

u 

unit of time in MIX 

1.3.1 

* 

“self’ in MIXAL 

1.3.2 

OF, IF, 2F, ..., 9F 

“forward” local symbol in MIXAL 

1.3.2 

OB, IB, 2B, ..., 9B 

“backward” local symbol in MIXAL 

1.3.2 

OH, 1H, 2H, ..., 9H 

“here” local symbol in MIXAL 

1.3.2 

©000 

rounded or special operations 

4.2.1 
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Seek and ye shall find. 
—Matthew 7:7 


When an index entry refers to a page containing a relevant exercise, see also the answer to 
that exercise for further information; an answer page is not indexed here unless it refers to a 
topic not included in the statement of the exercise. 


A priori tests, 75. 

Abacus, 180, 184. 

Abramowitz, Milton, 41, 611. 

Absolute error, 293-294. 

Absorption laws, 636. 

ACC: Floating point accumulator, 202-203, 
232-233. 

Acceptance-rejection method, 120-123, 129, 
134-135, 553. 

Accuracy of floating point arithmetic, 206, 
213-230, 237, 311-312, 420, 466-467. 

Accuracy of random number generation, 26, 
90-92, 103, 171. 

Adaptation of coefficients, 471-479, 

498-501. 

Addition, 178, 191, 194, 197, 250-252, 
262-263, 265-268. 
complex, 468. 
continued fractions, 602. 
floating point, 199-204, 209, 211-216, 
219-230, 232-234, 237, 238-239, 249. 
fractions, 313-315. 
mixed-radix, 193, 266, 589. 
mod m, 11, 14-15, 171, 187, 271-272. 
modular, 269, 277. 

multiple-precision, 250-252, 262-263, 
265-268. 

polynomial, 399-401. 
power series, 506. 


Addition chains, 444-466, 501. 
ascending, 447. 
dual, 462, 466. 

I 0 -, 459-460, 464. 

star, 447, 453-457, 461, 463. 

Addition-subtraction chains, 465. 

Additive random number generation, 26-30, 
36-37, 171-173. 

Adleman, Leonard Max, 380, 386, 396. 

Admissible numbers, 165. 

Ahrens, Joachim Heinrich Liidecke, 114, 
124, 125, 128, 129, 131, 132, 133, 135, 
136, 553. 

Ahrens, Wilhelm Ernst Martin Georg, 192. 

al-Biruni, abu Rayhan Muhammad ibn 
Ahmad, 441. 

al-Kashi, Jemshid ibn Mes'ud, 182, 

309, 443. 

al-Khwarizmi, abu Ja‘far Muhammad ibn 
Musa, 181, 265. 

al-Uqlidisi, abu al-Hasan, Ahmad ibn 
Ibrahim, 182, 265, 441. 

Alanen, Jack David, 29. 

Alekseev, Boris Vasil’evich, 112. 

Algebra, free associative, 418-419. 

Algebraic dependence, 499. 

Algebraic integers, 380. 

Algebraic number field, 314, 316, 632-633. 
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Algebraic system: A set of elements 

together with operations defined on 
them, see Field, Ring, Unique 
factorization domain. 

ALGOL, 265. 

Algorithms: Precise rules for transforming 
specified inputs into specified outputs 
in a finite number of steps, 
analysis of, 7-8, 73-74, 135, 

140, 238-249, 262-263, 266-267, 
278-280, 285-289, 293-297, 300-301, 
311, 330-364, 367-369, 383-386, 
397-398, 417, 420, 427-429, 436-441, 
451, 481-482, 497, 503-505, 511, 
513-514, 654. 

complexity of, 133, 265, 299, 301, 
444-466, 475-479, 487-505. 
historical development of, 318-320, 

441, 443. 

proof of, 265, 266, 319-320. 

Alias method, 115-116, 122, 134, 555. 

Alt, Helmut, 647. 

AMM: American Mathematical Monthly, 
the official journal of the Mathematical 
Association of America, Inc, 

Analysis of algorithms, 7-8, 73-74, 135, 

140, 238-249, 262-263, 266-267, 
278-280, 285-289, 293-297, 300-301, 
311, 330-364, 367-369, 383-386, 
397-398, 417, 420, 427-429, 436-441, 
451, 481-482, 497, 503-505, 511, 
513-514, 654. 

Analytical Engine, 185. 

Ananthanarayanan, Kasi, 123. 

AND (logical and), 305, 312, 373-374, 

434, 617. 

Anderson, Stanley F., 296. 

Anderson, Theodore Wilbur, 71. 

Antanairesis, 318-319, 362. 

Apollonius of Perga, 209. 

Apparition, rank of, 393. 

Approximate equality, 208, 217-219, 
228-229. 

Approximately linear density, 120-121. 

Approximation, by rational functions, 420, 
515. 

by rational numbers, 314-316, 363-364. 

Arabic mathematics, 181-182, 265, 309, 

441, 443. 

Arbitrary precision, 265, 268, 314. 

Archibald, Raymond Clare, 185. 

Arctangent, 297. 

Arithmetic, 178-515, see Addition, 
Comparison, Division, Doubling, 
Exponentiation, Greatest common 
divisor, Halving, Multiplication, 
Reciprocal, Square root, Subtraction, 
complex, 189-190, 212, 268, 292-294, 
467-468, 482, 487, 497, 501, 


641-642, 647. 

floating point, 198-249, 443. 
fractions, 68, 313-316, 409, 506-507. 
fundamental theorem of, 317, 364, 464. 
mod m, 11-15, 171-172, 187, 271-272, 
277, 443. 

modular, 268-278, 287-290, 

431-441, 486. 

multiple-precision, 250-301, 327-330, 

339. 

polynomial, 399-505. 

power series, 506-515. 

rational, 68, 313-316, 409, 506-507. 

Arithmetic chains, 646. 

Arrival time, 128. 

Ashenhurst, Robert Lovett, 225, 227, 310. 

Aspvall, Bengt Ingemar, vii. 

Associative law, 214, 217, 224-225, 227, 
229, 399, 636. 

Asymptotic values: Functions that express 
the limiting behavior approached by 
numerical quantities, 57, 248, 398, 
451-453, 506, 520 . 

Atanasoff, John Vincent, 186. 

Atrubin, Allan Joseph, 299. 

Aurifeuille, Leon Francois Antoine, 376. 

Automata (plural of Automaton), 295, 
297-301, 311, 398. 

Automorphic numbers, 278. 

Avogadro, Amedeo, number, 198, 211, 223, 
225-226. 

Axioms for floating point arithmetic, 
214-218, 227-229. 

b -ary number, 144. 

b -ary sequence, 144-146, 164. 

Babbage, Charles, 185. 

Babenko, Konstantin Ivanovich, 350, 361. 

Babington-Smith, Bernard, 2-3, 72-73. 

Babylonian mathematics, 180, 209, 318. 

Bachet, Claude Gaspard, sieur de Meziriac, 
192. 

Bachmann, Paul Gustav Heinrich, 605. 

Bag, 636. 

Baker, Kirby Alan, 300. 

Balanced decimal number system, 195. 

Balanced mixed-radix number system, 

100, 586. 

Balanced ternary number system, 190-193, 
211, 268, 336. 

Ballantine, John Perry, 263. 

Bareiss, Erwin Hans, 276, 416. 

Barton, David Elliott, 72. 

Base of representation, 179. 
floating point, 198, 210-211, 248. 

Bauer, Friedrich Ludwig, 226-227, 310. 

Baumgart, Bruce Guenther, 90. 

Bays, John Carter, 32. 

Beckenbach, Edwin Ford, 130. 
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Becker, Oskar Joachim, 342. 

Belaga, Eduard Grigor’evich, 477. 

Bell Telephone Laboratories Model V, 209. 

Bellman, Richard Ernest, ix. 

Benford, Frank, 240. 

Bentley, Jon Louis, 136. 

Berglund, Glenn David, 642. 

Bergman, George Mark, 620. 

Berlekamp, Elwyn Ralph, vi, 420, 423, 429, 
436, 625. 

Bernoulli, James (= Jakob = Jacques), 
184. 

numbers, 661. 
sequences, 165. 

Besicovitch, Abram Samoilovitch, 165. 

Beta distribution, 129-131. 

Beyer, William Aaron, 110. 

Bharati Krishna Tirtharji Maharaja, 

Jagadguru Swami Sri, shankaracharya 
of Goverdhana Matha, 192. 

Bienayme, Irenee Jules, 72. 

Bilinear forms, 487-496, 502-505. 

Billingsley, Patrick Paul, 611. 

Bin-packing problem, 550. 

Binary basis, 196. 

Binary-coded decimal, 186, 305, 311-312. 

Binary computer: A computer that 

manipulates numbers primarily in the 
binary (radix 2) number system, 29-31, 
186, 311, 322, 373-374, 617. 

Binary-decimal conversion, 302-312. 

Binary digit, 179, 184. 

Binary gcd algorithms, 321-323, 330-339, 
417. 

Binary method for exponentiation, 

441-444, 463-465. 

Binary number systems, 179, 182-190, 
193-197, 400, 441, 464. 

Binary point, 179. 

Binary search, 307. 

Binary trees, 639. 

Binet, Jacques Phillipe Marie, 605. 

Bini, Dario, 482, 654-655. 

Binomial distribution, 131-133, 136, 

385, 531. 
continuous, 553. 
negative, 135. 
tail, 160. 

Binomial number system, 193. 

Binomial theorem, 507. 

Birnbaum, Zygmunt Wilhelm, 55. 

BIT: Nordisk Tidskrift for Informations- 
behandling, a journal published by 
DATA A/S, Copenhagen, Denmark. 

Bit: “Binary digit,” either zero or unity, 

179, 184. 

random, 11, 29-31, 35-36, 45, 

114-115, 133. 

Bit manipulation, see Boolean operations. 


Bjork, Johan Harry, 229. 

Bluestein, Leo I., 588. 

Blum, Bruce Ivan, 265. 

Blum, Fred, 415, 499. 

Bofinger, Eve, 535. 

Bofinger, Victor John, 535. 

Bohlender, Gerd, 573. 

Boolean operations, 29-31, 177, 186, 305, 
311-312, 322, 373-374, 434, 439, 617, 
629, 637. 

Border rank, 505. 

Borel, Emile F61ix Edouard Justin, 164. 
Borodin, Allan Bertram, 479, 486, 496, 648. 
Borosh, Itzhak, 104, 113, 276, 548. 

Borrow, 252, 258, 266. 

Bouyer, Martine, 268. 

Bowden, Joseph, 185. 

Box, George Edward Pelham, 117. 

Boyer, Carl Benjamin, 182. 

Bradley, Gordon Hoover, 325, 362. 
Bramhall, Janet Natalie, 511. 

Brauer, Alfred Theodor, 451, 459, 464. 

Bray, Thomas Arthur, 123, 521. 

Brent, Richard Peirce, vii, 7, 27, 125, 131, 
134, 136, 226, 265, 297, 335, 338, 339, 
367, 371, 482, 510, 512-513, 515, 530, 
584, 597, 608, 656, 657. 

Bright, Herbert Samuel, 30. 

Brillhart, John David, vi, 28, 378, 380, 384. 
Brockett, Roger Ware, 652. 

Brooks, Frederick Phillips, 210. 

Brouwer, Luitzen Egbertus Jan, 166. 

Brown, D. J. Spencer, 637. 

Brown, George William, 130. 

Brown, Mark Robbin, 652. 

Brown, William Stanley, 401, 410, 420, 

435, 629. 

Brute force, 596. 

Buchholz, Werner, 186, 210. 

Buckholtz, Thomas Joel, 661. 

Bunch, James Raymond, 482. 

Buneman, Oscar, 647. 

Burks, Arthur Walter, 186. 

CACM: Communications of the ACM, 
a publication of the Association for 
Computing Machinery, Inc. 

Cahen, Eugene, 621. 

Calculating prodigies, 279. 

Campbell, Sullivan Graham, 210. 
Cancellation error, 230. 
avoiding, 574. 

Cantor, David Geoffrey, 430-431. 

Cantor, Georg Ferdinand Louis Philippe, 
193. 

Capovani, Milvio, 482. 

Caramuel Lobkowitz, Juan, 183. 

Cards, playing, 139, 174. 

Carlitz, Leonard, 79, 86. 
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Carmichael, Robert Daniel, 19. 

numbers, 609, 613. 

Carr, John Weber, HI, 210, 226, 227. 

Carry, 189, 191, 232-234, 250-254, 258, 
261-263, 265-266, 400, 450, 456. 
Cassels, John William Scott, 105, 151. 
Casting out nines, 273, 287, 307. 

Catalan, Eugene Charles, numbers, 639. 
Cauchy, Augustin Louis, 192, 314, 506. 
inequality: (5 Za k b k ) 2 < (SaJ)(EbJ), 

95, 216. 

CDC 1604, 276. 

Ceiling function, 77, 665. 

Cesaro, Ernesto, 337. 

Chain step, 475, 500-503. 

Chaitin, Gregory John, 164, 166. 

Chan, Tony Fan-Cheong [M $ |§], 572. 
Chappie, M. A., 511. 

CHAR (convert to characters), 311. 
Characteristic, 199, see Exponent part. 
Characteristic polynomial, 480. 

Charles XII of Sweden, 184. 

Chartres, Bruce Aylwin, 227. 

Chebotarev, Nikolai Grigor’evich, 632. 
Cheng, Russell Ch’uan Hsun [@ III fill], 130. 
Chhin Chiu Shao [fF/lfnJj 271. 

Chi-square distribution, 41, 45, 47, 65, 130. 
table, 41. 

Chi-square test, 39-45, 50-54, 56-59. 
Chinese mathematics, 181-182, 271. 

Chinese remainder algorithm, 20, 274, 277, 
289, 435-436, 486. 

Chinese remainder theorem, 269-271, 373, 
629. 

generalized, 276. 

for polynomials, 437 (exercise 3), 486, 
490-492. 

Chirp transform, 588 (exercise 8). 

Choice, random, 2, 114-116, 122, 134. 
Christiansen, Hanne Delgas, 72. 

Church, Alonzo, 165. 

Classical algorithms, 250-268. 

Cochran, William Gemmell, 53. 

Cocke, John, 212. 

Coefficients of a polynomial, 399. 
adaptation of, 471-479, 498-501. 
size of, 401, 414, 432-433, 438-439. 
Cohen, Daniel Isaac Aryeh, 579. 

Cohn, Paul Moritz, 418, 620. 

Colenne, Joseph Desire, 185. 

Collins, George Edwin, vi, 264, 265, 357, 
401, 410, 434, 435, 441, 595, 622. 
Collision test, 68-70, 72-73, 151. 

Colson, John, 192. 

Colton, Rev. Charles Caleb, vii. 
Combination, random, 136-141. 
Combination of random number generators, 
31-33, 36-37. 

Combinations with repetitions, 614. 


Combinatorial matrix, 112. 

Common divisors, 419, see also Greatest 
common divisor. 

Commutative law, 214, 316, 636, 639. 

Commutative ring with identity, 399, 

401, 407. 

Comp. J.: The Computer Journal, 

published by the British Computer 
Society. 

Companion matrix, 494. 

Comparison: Testing for <, =, or >. 
continued fractions, 606. 
floating point, 208, 217-219, 224, 

227-229. 
fractions, 315. 
mixed-radix, 274-275. 
modular, 274-275. 
multiple-precision, 266. 

Complement notations for numbers, 

14-15, 186-187, 194, 197, 212, 261, 

262, 264, 272. 

Complete binary tree, 555. 

Completely equidistributed sequence, 164. 

Complex arithmetic, 189-190, 212, 268, 
292-294, 467-468, 482, 487, 497, 501, 
641-642, 647. 

Complex number representation, 189-190, 
193-194, 196, 268, 401. 

Complexity of calculation, 133, 265, 299, 

301, 444-466, 475-479, 487-505. 

Composition of power series, 514, 656. 

Computability, 154-156, 164-166, 169. 

Computational complexity, 133, 265, 299, 

301, 444-466, 475-479, 487-505. 

Congruential sequence, linear, 9-11. 
choice of increment, 10, 15, 21, 84-85, 93, > 
' 171. 

choice of modulus, 11-15, 170. 
choice of multiplier, 10, 15-25, 84-86, 
98-105, 170-171. 
choice of seed, 15, 19, 137, 170. 
period length, 15-22. 
subsequence of, 10, 71. 

Congruential sequence, quadratic, 

25-26, 34. 

Conjugate of complex number, 642, 667. 

Content of a polynomial, 405-406. 

Context-free grammar, 636. 

Continuant polynomials, 340-343, 358, 
361-363, 420, 548, 604, 605, 607, 621. 

Continued fractions, 339-340, 479, 500. 
infinite, 341, 358. 
quadratic irrationality, 342, 359, 

380-382, 398. 

regular, 330, 341-342, 352-353, 358-363. 
with polynomial quotients, 420, 479, 500. 

Continuous binomial distribution, 553. 

Continuous distribution function, 47, 51, 

55, 58, 116-117, 576. 
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Continuous Poisson distribution, 552. 
Convergents, 363, 380, 420. 

Conversion of representations, 205, 

208-209, 212, 237, 273-274, 277, 
287-289, 301, see also Radix 
conversion. 

Convolution, 290, 385, 535, 551. 
cyclic, 290-294, 300, 491-494, 502-503. 
multidimensional, 651. 
negacyclic, 503. 

Conway, John Horton, 385. 

Cook, Stephen Arthur, vi, 195, 280, 296, 
301, 617, 648. 

multiplication algorithm, 280-285, 617. 
Cooley, James William, 642. 

Coolidge, Julian Lowell, 467. 

Coonen, Jerome, 210. 

Copeland, Arthur Herbert, 165. 
Coppersmith, Don, 168, 169, 482. 

Coroutine, 360, 610. 

Correlation coefficient, 70-75, 78-85, 127. 
Cosine, 231, 471. 

Couffignal, Louis, 186. 

Counting law, 636. 

Coupon collector’s test, 61-63, 74, 151, 167. 
Covariance, 134. 

Coveyou, Robert Reginald, 26, 34, 84, 88, 
110, 527. 

Cox, Albert George, 263. 

Craps, 174. 

CRAY-I, 391. 

Cryptanalysis, 177, 386-389, 397, 486. 
Cusick, Thomas William, 548. 

Cycle in a sequence, 7-9, 21, 34-36. 
detection of, 4, 7-8. 

Cyclic convolution, 290-294, 300, 491-494, 
502-503. 

Cyclotomic polynomials, 378, 432-433, 440, 
492, 496. 

Dahl, Ole-Johan, 141. 

Darling, Donald Allan, 56. 

Datta, Bibhutibhusan, 441. 

Davenport, Harold, 359. 

Davis, Chandler, 564. 

Davis, Clive Selwyn, 603. 
de Bruijn, Nicolaas Govert, 196, 605, 614, 
629, 636. 
cycle, 35-36. 

de Groote, Hans Friedrich, 648. 
de Jong, Lieuwe Sytse, 497. 
de Jonquieres, Adm. Jean Philippe Ernest 
de Fauque, 445, 449, 458. 
de la Vallee Poussin, Charles Louis Xavier 
Joseph, 366. 

Debugging, 205-207, 260-261, 314, 656. 
DEC 20, 14. 


Decimal computer: A computer that 

manipulates numbers primarily in the 
decimal (radix ten) number system, 

186. 

Decimal digits, 179, 302. 

Decimal fractions, 181-182. 

Decimal number system, 181-183, 

194-195, 359. 

Decimal point, 179, 182. 

Decision, unbiased, 2, 114-116, 122, 134. 
Decuple-precision floating point, 268. 
Dedekind, Richard, 78. 
sum, 78-87, 104. 

Definitely greater than, 208, 218, 228. 
Definitely less than, 208, 218, 228. 

Definition of randomness, 2, 142-169. 
Degree of a polynomial, 399, 401, 418. 
Degrees of freedom, 41-42, 476-477, 
499-500. 

Dekker, Theodorus Jozef, 227, 229, 237. 
Dellac, H., 445. 

Density function, 119-120, 134. 

Dependence, 127, 134, see Independence of 
random numbers, 
algebraic, 499. 

linear, 381, 423, 425-427, 610. 

Derivative, 421, 470, 507, 631. 

Descartes, Rene, 391. 

Determinant, 338, 358, 415, 416, 479-480, 
482, 496. 

Deviate: A random number. 

Dewey, Melvil, notation for trees, 530. 
Diaconis, Persi Warren, 248, 249, 578. 
Diamond, Harold George, 230. 

Dice, 2, 6, 39-42, 56, 115-116, 174. 
Dickman, Karl, 367. 

Dickson, Leonard Eugene, 271, 371, 

376, 598. 

Dieter, Ulrich Otto, vii, 85, 87, 98, 110, 

114, 124-125, 129, 132, 133, 553. 
Differences, 281-282, 484-487, 498. 
Differentiation, see Derivative. 

Diffie, Bailey Whitfield, 388. 

Digit: One of the symbols used in positional 
notation; usually a decimal digit, one 
of the symbols 0, 1, ..., or 9. 
binary, 179, 184. 
decimal, 179, 302. 
hexadecimal, 179, 185, 194. 
octal, 185, 194. 

Dilogarithm, 578. 

Diophantine equations, 326-327, 337, 359. 
Direct product, 502, 504, 505. 

Direct sum, 502, 504, 505. 

Directed graph, 460-462, 466. 

Dirichlet, Peter Gustav Lejeune-, 637. 
Discrepancy, 37, 105-110, 113. 

Discrete distribution function, 45, 115-116, 
131-141. 
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Discrete Fourier transform, 290-294, 300, 
482-484, 487, 494, 497, 502-503. 

Discriminant of a polynomial, 619, 

628, 632. 

Distinct-degree factorization, 429-431, 

439, 632. 

Distribution: A specification of probabilities 
that govern the value of a random 
variable, 2, 114, 116. 
beta, 129-131. 

binomial, 131-133, 136, 160, 385, 

531, 553. 

chi-square, 41, 45, 47, 65, 130. 
discrete, 45, 115-116, 131-141. 
exponential, 114, 128-133, 554. 

F-, 130. 

of floating point numbers, 238-249. 
gamma, 129-130, 135. 
geometric, 131, 132, 135, 535 (exercise 3), 
549, 551. 

integer-valued, 131-135. 

Kolmogorov-Smirnov, 48-49, 55-56, 58. 
of leading digits, 239-249. 
mixture, 118-119, 133-134. 
negative binomial, 135. 
normal, 54, 71, 117-127, 129, 130, 
134-135, 368. 

partial quotients of continued fraction, 
345-353, 615-616. 

Poisson, 53, 132-133, 135-136, 517. 
of prime factors, 367-369, 395. 
of prime numbers, 366-367, 396, 616, 
632-633. 

Student’s, 130. 
t~, 130. 

tail of binomial, 160. 
tail of normal, 122-123, 134. 
uniform, 2, 9, 45, 47, 55, 114, 116-120, 
133, 248. 

variance-ratio, 130. 
wedge-shaped, 120-121. 

Distribution functions, 45-47, 51, 116-117, 
135, 241-242, 247, 345-346. 
continuous, 47, 51, 55, 58, 116-117, 576. 
discrete, 45, 115-116, 131-141. 
empirical, 47-50. 
mixture of, 118-119, 133-134. 
product of, 116-117. 

Distributive laws, 215-216, 229, 317, 

399, 636. 

Divide-and-correct, 255-260, 263-268. 

Divided differences, 485, 498. 

Dividend: The quantity u while 
computing u/v. 

Division, 178, 250-251, 255-260, 263-268, 
295-297. 

complex, 212, 268, 647. 
continued fractions, 602. 
double-precision, 235-237. 


floating point, 204-205, 208, 212, 215, 

224, 226, 228-230, 235-237, 248, 577. 
fractions, 313, 315. 
long, 255-260, 263-268. 
mixed-radix, 193, 589. 
mod m, 25 (exercise 7), 277, 337, 

427-428, 480. 
modular, 277. 

multiple-precision, 255-261, 263-268, 
295-297. 

polynomial, 401-420, 468-469, 515. 
power series, 506-507, 514-515. 
pseudo-, 407-409, 416, 418. 
string polynomials, 418. 
synthetic, 402. 

Divisor: The quantity v while 

computing u/v; or, we say x is a 
divisor of y if y mod x = 0; it is a 
proper divisor if it is a divisor such 
that 1 < x < y. 
polynomial, 403. 

Dixon, John Douglas, 356, 385, 395, 397. 
Dixon, Wilfrid Joseph, 71. 

Dobell, Alan Rodney, 16. 

Dobkin, David Paul, 638, 652. 

Donsker, Monroe David, 532. 

Doob, Joseph Leo, 532. 

Dorn, William Schroeder, 469. 
Double-precision arithmetic, 230-237, 
263-264, 278-279. 

Doubling, 305, 360, 443. 

Doubling step, 447. 

Downey, Peter James, 466. 

Dragon curve, 564, 566, 607. 

Dresden, Arnold, 180. 

Drift, 221-222, 229-230. 

Dual of an addition chain, 462, 466, 639. 
Duncan, Robert Lee, 249. 

Duodecimal number system, 183. 

Dupre, Athanase, 605. 

Durbin, James, 54. 

Durham, Stephen Daniel, 32. 

Durstenfeld, Richard, 140. 

e, 11, 73, 342, 360, 659-660, 666. 

Earle, John Goodell, 296. 

Easton, Malcolm Coleman, 555. 

EDVAC, 210. 

Effective algorithms, 154-156, 164-166, 169. 
Egyptian mathematics, 318, 443. 

Eisenstein, Ferdinand Gotthold, 438. 
Electrologica X8, 206. 

Ellipse, random point on, 130-131, 136. 
Ellipse, volume of, 101. 

Empirical distribution function, 47-50. 
Empirical tests for randomness, 59-75. 
Encoding a permutation, 64, 75, 139. 
Encoding secret messages, 177, 386-389, 
397, 486. 



674 INDEX AND GLOSSARY 


Engineering Research Associates, 192. 
ENIAC, 52. 

Enison, Richard Lawrence, 30. 

Enumeration of tree structures, 639. 
Equality, approximate, 208, 217-219, 
228-229. 

Equidistributed sequence, 143-145, 157, 
166-169. 

Equidistribution test, 59, 72. 

Equivalent addition chains, 461, 466. 
Eratosthenes, sieve of, 394. 

Erdos, Pal (= Paul), 369, 451, 638. 

ERH, see GRH. 

ERNIE, 3. 

Error, relative, 206, 213, 216-217, 237, 240. 
Error estimation, 206, 213, 216-217, 237, 
240, 293-294. 

Essential equality, 218-219, 228-229. 

Estrin, Gerald, 469. 

Euclides (= Euclid), 318-320. 

Euclid’s algorithm, 81-83, 113, 272, 289, 
317-320, 323-324, 338-339, 544. 
analysis of, 339-364, 605. 
extended, 325, 337, 417, 515. 
for polynomials, 405-420, 515. 
for string polynomials, 419. 
multiple-precision, 327-330. 

Eudoxus of Cnidus, 318, 342. 

Euler, Leonhard, 340, 360, 361, 391, 602. 
constant 7 , 342, 360, 611, 629, 

659-660, 666 . 

theorem, 19, 270, 273, 523. 
totient function <p(n ), 19, 273, 353-354, 
361, 548, 666 . 

Evaluation: Computing the value, 
of determinants, 416, 479-480, 482. 
of mean and standard deviation, 

216, 229. 

of monomials, 465-466. 

of polynomials, 466-505, 588 (exercise 8 ). 

of powers, 441-466. 

Eve, James, 474, 499. 

Eventually periodic sequence, 7-8, 21, 359, 
369-371. 

Excess q exponent, 198-199, 211, 231. 
Exclusive or, 29-31, 177, 400. 

Exercises, notes on, ix-xi. 

Exhaustive search, 99-100. 

Exponent overflow, 201, 203, 206, 211, 216, 
227-228, 233. 

Exponent part of a floating point number, 
198-199, 231, 248, 268. 

Exponent underflow, 201, 203, 206, 211, 
216, 227-228, 233. 

Exponential deviate, generating, 128- 
Exponential distribution, 114, 

128-133, 554. 

Exponential function, 297, 471, 514. 


Exponential sums, 79-81, 105-109, 113, 

168, 366. 

Exponentiation: Raising to a power, 
441-466, 507, 656. 

Extended arithmetic, 230, 593. 

Extended Euclidean algorithm, 325, 337, 
417, 515. 

F-distribution, 130. 

Factor method of exponentiation, 443, 445, 
462-463, 466. 

Factorial number system, 64, 192. 

Factorial power, 281-282, 497, 597, 664. 
Factorization: Discovering factors, 
of integers, 12-13, 317, 353, 

364-398, 464. 
of polynomials, 420-441. 
uniqueness of, 403-404, 417. 

FADD (floating add), 208, 209, 211, 498. 

Fan, Chung Teh g], 137. 

Farmwald, Paul Michael, 190. 

Fast Fourier transform, 71, 290-294, 300, 
483-484, 486, 494, 497, 651, 653. 
Fateman, Richard J, 443. 

FCMP (floating compare), 208, 229. 

FDIV (floating divide), 208. 

Fermat, Pierre de, 371-372, 375, 391, 544. 
numbers, 13, 371, 375, 380. 
theorem, 375, 394, 421. 

FFT, see Fast Fourier transform. 

Fibonacci, Leonardo, of Pisa, 181, 192, 265. 
number system, 193. 
numbers: elements of the Fibonacci 
sequence, 664. 
numbers, table, 661. 
sequence, 26, 28, 33, 34, 44, 50, 52, 88 , 
343, 172, 448, 464, 568, 611, 616. 

Field: An algebraic system admitting 
addition, subtraction, multiplication, 
and division, 197, 314, 401-403, 

487, 506. 

finite, 28, 438, 529, 630, 643. 

Fike, Charles Theodore, 472. 

Finite Fourier transform, see Discrete 
Fourier transform. 

Finite sequence, random, 145, 161-164. 
Fischer, Michael John, vii, 301. 

Fischer, Patrick Carl, 226. 

FIX, 208. 

Fix-to-float conversion, 205, 208. 

Fixed point arithmetic, 193, 198, 292-294. 
Fixed slash, 314-315, 363-364. 

Flat distribution, see Uniform distribution. 
Flehinger, Betty Jeanne, 247. 

Fletcher, William, 654. 

Float-to-fix conversion, 208, 209, 212. 
Floating binary numbers, 198, 210-212, 
238-239, 248. 
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Floating decimal numbers, 198, 210-211, 
238-239. 

Floating hexadecimal numbers, 238-239, 
248. 

Floating point arithmetic, 33, 172, 180, 
198-249, 276, 314, 530. 
accuracy of, 206, 213-230, 237, 311-312, 
420, 466-467. 

addition, 199-204, 209, 211-216, 219-230, 
232-234, 237, 238-239, 249. 
axioms, 214-218, 227-229. 
comparison, 208, 217-219, 224, 227-229. 
decuple-precision, 268. 
division, 204-205, 208, 212, 215, 224, 226, 
228-230, 235-237, 248, 577. 
double-precision, 230-237, 263-264. 
interval, 212, 225-227, 230, 570. 
mod, 212, 228. 

multiplication, 204, 207, 208, 215-216, 
224, 226-230, 234-235, 237, 248-249. 
operators of MIX, 208, 211, 498. 
reciprocal, 228, 248. 
remainder, 212, 228. 
single-precision, 198-213. 
subtraction, 200-204, 214-216, 219-225, 

230, 232-234, 238-239, 249. 
summation, 216, 229. 
triple-precision, 237. 
unnormalized, 223-225, 227, 229, 310. 

Floating point numbers, 180, 198-199, 206, 
223, 225, 231. 
radix b, excess q, 198-199. 
statistical distribution, 238-249. 

Floating point radix conversion, 309-312. 

Floating point trigonometric subroutines, 

231, 471. 

Floating slash, 314-316, 363. 

Floor function, 77, 665. 

FL0T (float), 208. 

Floyd, Robert W, 7, 265, 344, 487. 

FMUL (floating multiply), 208, 498. 

Forsythe, George Elmer, 4, 124. 

FORTRAN, 171-172, 265. 

Fourier, Jean Baptiste Joseph, 264. 
division method, 264. 
series, 86, 467-468. 
transform, discrete, 290-294, 300, 
482-484, 487, 494, 497, 502-503. 

Fraction overflow, 201, 239, 249. 

Fraction part of a floating point number, 
198-199, 206, 223-225, 231, 239-249. 

Fractions: Numbers in [0,1). 
conversion, 302-311. 
decimal, 181-182. 
exponentiation, 464. 
terminating, 311. 

Fractions: Rational numbers, 313, 401. 
arithmetic on, 68, 313-316, 409, 506-507. 

Fraenkel, Aviezri S, 274, 276, 585. 


FYanel, Jerome, 243. 

Franklin, Joel Nick, 142, 152, 153, 164, 167, 
168, 542. 

Franta, William Ray, 58. 

Free associative algebra, 418-419. 

Frequency function, see Density function. 
Frequency test, 59, 72. 

Friedland, Paul, 570. 

Frobenius, Ferdinand Georg, 625, 632. 

FSUB (floating subtract), 208. 

Fundamental theorem of arithmetic, 317, 
364, 464. 

Galambos, Janos, 611. 

Galois, Evariste, field, see Finite field. 

group of a polynomial, 625, 632. 
Gambling system, 155. 

Gamma distribution, 129-130, 135. 

Gamma function, incomplete, 54, 58, 129. 
Gap test, 60-61, 72-73, 131, 151, 167. 
Gardner, Martin, 38, 184. 

Garner, Harvey Louis, 265, 274, 276. 

Gauss, Karl (= Carl) Friedrich, 346, 398, 
404, 543, 631. 

lemma about polynomials, 404, 626. 
Gaussian integers, 544. 

Gay, John, 1. 

gcd: Greatest common divisor. 

Gebhardt, Friedrich, 33. 

Gehrhardt, Karl Immanuel, 184. 

Geiringer, Hilda, von Mises, 73. 

Gel’fond, Aleksandr Osipovich, 627. 
Generalized Riemann hypothesis, 380, 632. 
Generating functions, 135, 140, 246-247, 
262-263, 333, 506, 535, 551, 568, 623, 
624, 629, 636-637. 

Generating uniform deviates, 9-37, 

170-173. 

Geometric distribution, 131, 132, 135, 535 
(exercise 3), 549, 551. 

Geometric series, 79, 291, 501, 641. 

Gibb, Allan, 227. 

Gill, Stanley, 210. 

Gioia, Anthony Alfred, 449. 

Girard, Albert, 405. 

Givens, James Wallace, Jr., 90. 

Glaser, Anton, 185. 

Globally nonrandom behavior, 49-51, 75. 
Goertzel, Gerald, 468. 

Goldschmidt, Robert E., 296. 

Goldstine, Herman Heine, 186, 263, 310. 
Golomb, Solomon Wolf, 141, 430, 611, 652. 
Gonsalves, see Vicente Gonsalves. 

Gonzalez, Teofilo, 58. 

Good, Irving John, 60, 169. 

Gosper, Ralph William, Jr., vii, 98, 104, 
112, 190, 339, 360, 363, 518, 602. 
Gosset, William Sealy (= Student), 
distribution, 130. 
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Goulard, A., 458. 

Gradual underflow, 206. 

Graham, Ronald Lewis, 465, 565. 

Graph, 460-462, 466. 

Gray, Frank, code, 193, 640. 

Gray, Herbert L., 227. 

Greater than, definitely, 208, 218-219, 
227-228. 

Greatest common divisor, 316-339, 464. 
binary algorithm for, 321-323, 

330-339, 417. 

Euclidean algorithm for, see Euclid’s 
algorithm. 

multiple-precision, 327-330, 339. 
of n numbers, 323-324, 362, 364. 
of polynomials, 405-420, 434-436, 440. 
in unique factorization domain, 405. 
Greatest common right divisor, 419. 

Greek mathematics, 180-181, 318-320, 342. 
Green, Bert F., 26. 

Greenberger, Martin, 16, 84, 525. 
Greenwood, Robert Ewing, 72. 

GRH, see Generalized Reimann hypothesis. 
Grosswald, Emil, 86. 

Grube, Andreas, 547. 

Griinwald, Vittorio, 188, 189. 

Guilloud, Jean, 268. 

Gustavson, Fred Gehrung, 657. 

Guy, Richard Kenneth, 385, 396. 

Hadamard, Jacques Salomon, inequality, 
414, 418, 480. 

Halberstam, Heini, 614. 

Hales, Alfred Washington, 430. 

Halton, John Henry, 157. 

Halving, 277, 311, 321-322, 360, 443. 
Hamblin, Charles Leonard, 401. 

Hamlet, prince of Denmark, v. 

Hammersley, John Michael, 173. 

Hamming, Richard Wesley, 240, 248. 
Handscomb, David Christopher, 173. 
Hansen, Eldon Robert, 574. 

Hansen, Walter, 453, 455, 457, 459-460, 
464. 

Hanson, Richard Joseph, 573. 

Hardware: Computer circuitry. 

algorithms suitable for, 212 (exercise 15), 
229 (exercise 17), 265-267, 276, 
296-299, 305, 310-312, 320-321, 
441-442, 637. 

Hardy, Godfrey Harold, 366, 369, 606. 
Harmonic numbers, 661-662, 664. 

Harmonic probability, 249. 

Harmuth, Henning Friedolf, 483. 

Harriot, Thomas, 183. 

Harris, Bernard, 519. 

Harris, Vincent Crockett, 323, 339. 
Harrison, Charles, Jr., 227. 

Harrison, Michael Alexander, iv. 


Hashing, 68, 555. 

Haynes, Charles Edmund, Jr., 104. 

Hebb, Kevin Ralph, 458. 

Heilbrdnn, Hans Arnold, 356-357, 362. 
Heindel, Lee Edward, 622. 

Heilman, Martin Edward, 388. 

Henrici, Peter, 315, 507. 

Hensel, Kurt Wilhelm Sebastian, 433, 628. 

lemma, 35, 439. 

Hermite, Charles, 111. 

Herzog, Thomas Nelson, 166, 558. 
Hexadecimal digits, 179, 185, 194. 
Hexadecimal number system, 179, 184-185, 
593. 

floating point, 238-239, 248. 
nomenclature for, 185. 

Hickerson, Dean Robert, 384. 

Hindu mathematics, 181, 265. 

Hitchcock, Frank Lauren, 488. 

Hlawka, Edmund, 113. * 

Hoaglin, David Caster, vii. 

Hoare, Charles Antony Richard, 642. 
Homogeneous polynomial, 418, 640. 
Hopcroft, John Edward, 482, 489, 641. 
Horner, William George, 467, 470. 
rule for polynomial evaluation, 467-469, 
479, 485, 496, 499, 501. 

Horowitz, Ellis, 486. 

Howard, John Vernon, 165. 

Howe, Marion Elaine, vii. 

Howell, Thomas David, 648. 

Huff, Darrell, 39. 

Hull, Thomas Edward, 16. 

Hurwitz, Adolf, 360, 603. 

Hyde, John Porter, 401. 

IBM 360/91, 380. 

IBM System/370, 14-15. 

Idempotent, 517, 636. 

Identity, commutative ring with, 399, 401, 
407. 

Iff: If and only if. 

Ikebe, Yasuhiko, 237. 

Imaginary radix, 189, 193-194, 268. 
Improving randomness, 25, 31-34, 37. 
Inclusion and exclusion principle, 337, 536, 
567, 593, 623, 640. 

Incomplete gamma function, 54, 58, 129. 
Increment in a linear congruential sequence, 
9-10, 15, 21, 84-85, 93, 171. 
Independence, algebraic, 499. 

Independence, linear, 381, 423, 425-427, 
610. 

Independence of random numbers, 2, 40, 
43-44, 50, 53, 57, 91, 225, 532. 

Indian mathematics, 181, 192, 265. 
Induction, mathematical, 319. 
on the course of computation, 251, 254, 
265-266, 320. 
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Infinite continued fraction, 341, 358. 
Infinity, representation of, 209, 230, 

315, 593. 

Infinity lemma, 564. 

Inner product, 95, 481, 502 (exercise 50). 
Integrated circuit module, 297. 

Integer, random, 

among all positive integers, 143, 242, 249, 
439, 453. 

in a bounded set, 114-115, 171. 

Integer solution to equations, 326-327, 337, 
359. 

Integer-valued distribution, 131-135. 
Integration, 146-147, 154, 244. 

Interpolation, 281-282, 348, 484-486, 490, 
492, 498, 641, 657. 

Interpretive routine, 210. 

Interval arithmetic, 212, 225-227, 230, 570. 
Inverse Fourier transform, 291, 588, 641. 
Inverse function, 116, 128, 656, 
see also Reversion. 

Inverse modulo m, 25, 277, 337, 437. 

Inverse of a matrix, 95-96, 314, 482, 657. 
Irrational radix, 193. 

Irrationality, quadratic, 342, 359, 

380-382, 398. 

Irreducible polynomial, 403, 417, 421, 
437-441. 

Ishibashi, Yoshihiro [E tl § &], 275. 

Iteration of series, 511-513, 515. 

Iur’ev, Sergei Petrovich, 350. 

Iverson, Kenneth Eugene, 210. 

JACM: Journal of the ACM, a publication 
of the Association for Computing 
Machinery, Inc. 

Jacobi, Carl Gustav Jacob, symbol, 

396-397. 

JAE (jump A even), 322, 462. 

Ja’Ja’, Joseph, 496. 

Janssens, Frank, 104, 110. 

Jansson, Birger, 518, 527. 

JA0 (jump A odd), 322. 

Jefferson, Thomas, 213. 

Jeremiah, 515. 

Johnk, Max Detlev, 130. 

Johnson, Samuel, 213. 

Jones, Rev. Hugh, 184, 309. 

Jones, Terence Gordon, 137. 

Jordaine, Joshua, 183. 

Judd, John Stephen, 378. 

Jurkat, Wolfgang Bernhard, 641. 

JXE (jump X even), 322. 

JX0 (jump X odd), 203, 322. 


fc-distributed sequence, 144-149, 162, 164, 
166-168. 

Kac, Mark, 369. 

Kahan, William M., vii, 206, 210, 211, 226, 
227, 228, 229, 230, 571, 574. 

Kanner, Herbert, 310. 

Karatsuba, Anatoli! Alekseevich, 279, 401. 
Keir, Roy Alex, 592. 

Kempner, Aubrey John, 188, 363. 

Kendall, Maurice George, 2-3, 72-73. 
Kermack, William Ogilvy, 72. 

Kerr, Leslie Robert, 641. 

Kesner, Oliver, 210. 

Khinchin, Aleksandr Iakovlevich, 339, 604. 
Kinderman, Albert John, 125-126, 130. 
Klarner, David Anthony, 197. 

Klem, Laura, 26. 

Knop, Robert Edward, 131. 

Knopp, Konrad, 347. 

Knuth, Donald Ervin [gj if ii, vi-vii, 

4, 29, 85, 133, 152, 180, 189, 210, 227, 
318, 357, 362, 369, 472, 561, 564, 611, 
661, 689. 

Knuth, Jennifer Sierra, xiv. 

Knuth, John Martin, xiv. 

Kohavi, Zvi, 479. 

Kolmogorov, Andrei Nikolaevich, 54, 163, 
165, 166, 169. 

Kolmogorov-Smirnov test, 45-52, 54-58, 
59, 68. 

Konheim, Alan Gustave, 247. 

Konig, Hermann, 642. 

Koons, Florence, 310. 

Kornerup, Peter, 315-316. 

Korobov, Nikolai Mikhailovich, 110, 

152, 164. 

Kraitchik, Maurice Borisovich, 380, 391. 
Krishnamurthy, Edayathumangalam 
Venkataraman, 264. 

Kronecker, Leopold, 431, 605, 623, 

631, 663. 

Kruskal, Martin David, 520. 

KS test, see Kolmogorov-Smirnov test. 
Kuipers, Lauwerens, 110, 164. 

Kulisch, Ulrich Walter Heinz, 227. 

Kung, Hsiang Tsung [ft Jr ], 510, 

514, 657. 

Kuz’min, Rodion Osievich, 346. 

La Touche, Mrs., 178, 214. 

Laderman, Julian David, 641. 

Lafon, Jean-Claude, 641. 

Lagrange, Joseph Louis, comte, 359, 363, 
437, 508. 

identity: (Ea fe 6 fc ) 2 = 

{Xa 2 k ){XbD - E(a k b, - aj b k ) 2 , 536. 
interpolation polynomial, 484. 
inversion formula, 508. 

Lake, George Thomas, 310. 
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Lalanne, Leon Louis Chretien, 192. 

Lame, Gabriel, 343. 

Landau, Edmund Georg Hermann, 578. 
Laplace, Pierre Simon, marquis de, 346. 
Large prime numbers, 14, 276, 374-378, 
388-394, 397, 432, 480. 

Lattice, 93. 

Lattice-point model of binary gcd 
algorithm, 330-338, 344. 

Laughlin, Harry Hamilton, 264. 

Lavaux, Michel, 104. 

lcm: Least common multiple. 

Leading coefficient of a polynomial, 399, 
433, 435. 

Leading digit, 179. 

distribution of, 239-249, 387. 

Leading zeros, 206, 223-227. 

Least common left multiple, 419. 

Least common multiple, 17, 22, 276-277, 
316-317, 320, 336, 464, 595. 

Least remainder algorithm, 361. 

Least significant digit, 179. 

Lebesgue, Henri Leon, measure, 154, 
159-161, 165, 350-352, 361. 

Legendre, Adrien Marie, 309, 366, 380. 
Leger, R., 552. 

Lehman, Russell Sherman, 371, 388. 
Lehmer, Derrick Henry, vi, 9-10, 45, 52, 
142, 264, 328-329, 367, 374, 375, 378, 
380, 391, 395, 397, 465, 607, 629. 
Lehmer, Derrick Norman, 263, 612. 
Lehmer, Emma Markovna Trotskaia, 374. 
Leibniz (= Leibnitz), Gottfried Wilhelm, 
freiherr von, 184. 

Lempel, Abraham, 530. 

Leonardo Pisano, see Fibonacci. 

Leong, Benton Lau, 466. 

Leslie, Sir John, 192. 

Less than, definitely, 208, 218-219, 
227-228. 

Levene, Howard, 72. 

LeVeque, William Judson, 359, 516. 

Levin, Leonid Anatol’evich, 164. 

Levy, Paul, 346. 

Lewis, John Gregg, 572. 

Lewis, Peter Adrian Walter, 642. 

Lewis, Theodore Gyle, 30. 
li: Logarithmic integral function. 

Liang, Franklin Mark, vii. 

Linear congruential sequence, 9-11. 
choice of increment, 10, 15, 21, 84-85, 
93, 171. 

choice of modulus, 11-15, 170. 
choice of multiplier, 10, 15-25, 84-86, 
98-105, 170-171. 
choice of seed, 15, 19, 137, 170. 
period length, 15-22. 
subsequence of, 10, 71. 


Linear equations, 276. 

integer solution to, 326-327. 

Linear iterative array, 297-300, 311. 

Linear lists, 265, 266, 268. 

Linear operators, 347-350, 361. 

Linear recurrence, 26-29, 34-37, 332-333, 
392-395, 568, 637. 

Linearly independent vectors, 381, 

425-427, 610. 

Linked memory, 265, 266, 268, 295, 400. 
Linking automaton, 295, 301. 

Linnainmaa, Seppo, 227, 229. 

Liouville, Joseph, 363. 

Lipton, Richard Jay, 478, 638. 

Liquid measure, 183. 

Littlewood, John Edensor, 366. 

Local arithmetic, 184. 

Locally nonrandom behavior, 43, 49-51, 

145, 162. 

Logarithm, 297. 
of power series, 514. 
of uniform deviate, 128. 

Logarithmic integral, 614. 

Logarithmic law of leading digits, 

239-249, 387. 

Logical operations, 29-31, 177, 186, 305, 
311-312, 322, 373-374, 434, 439, 617, 
629, 637. 

Long division, 255-260, 263-268. 

Loos, Rudiger Georg Konrad, 619. 

Lotti, Grazia, 482. 

Lovelace, Ada Augusta, countess of, 173. 
Loveland, Donald William, 165, 166, 169. 
Lower bounds, see Complexity of 
calculation. 

Lubkin, Samuel, 310. 

Lucas, Francois Edouard Anatole, 375, 391, 
395, 397. 

Luther, Herbert Adesla, 263. 

Maas, Robert Elton, 190. 

MacLaren, Malcolm Donald, vi, 31, 44, 123, 
525, 549. 

MacMahon, Maj. Percy Alexander, 566. 
MacMillan, Donald B., 210. 

Macnaghten, Antony Martin, 642. 
MacPherson, Robert D., 110. 

MacSorley, Olin Lowe, 265. 

Mahler, Kurt, 167. 

Mallows, Colin Lingwood, 72. 

Mandelbrot, Benoit B, 564. 

MANIAC m, 227. 

Manipulation of power series, 506-515. 
Mantel, Willem, 526. 

Mantissa, 199, see Fraction part. 

Manage, Aime, 185. 

Mark II Calculator, 209. 

Marsaglia, George, 22, 31, 44, 104, 110, 

114, 117, 118, 123, 128, 129, 130, 521, 
525, 552. 
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Martin, Monroe Harnish, 31, 35. 
Martin-Lof, Per, 163, 166. 

Math. Comp.: Mathematics of 

Computation, a journal published by 
the American Mathematical Society. 
Matrix: Rectangular array, 
characteristic polynomial, 480. 
determinant, 338, 358, 415, 416, 479-480, 
482, 496. 

greatest common right divisor, 419. 
inverse, 95-96, 314, 482, 657. 
multiplication, 481-482, 487-488, 
502-505, 641. 
null space, 425-427, 625. 
permanent, 480, 497. 
rank, 425-427, 488, 496, 502, 625. 
semidefinite, 551. 
singular, 112, 494-495, 501. 
triangularization, 425-427, 621, 625. 
Matrix (Bush), Dr. Irving Joshua, 38. 
Matthew, Saint, 668. 

Matula, David William, 194, 195, 312, 
315-316, 363. 

Maximum-of-t test, 49, 51, 57, 68, 74, 117, 
151, 167. 

Maya Indians, 180. 

McClellan, Michael Terence, 276. 
McCracken, Daniel Delbert, 210. 
McKendrick, A. G., 72. 

Mean, evaluation of, 216, 229. 

Measure, units of, 182-185, 193, 310. 
Measure theory, 154, 159-161, 165, 

350-352, 361. 

Mediant rounding, 314-315, 363-364. 
Mendelsohn, Nathan Saul, 195. 

Mendes France, Michel, 602. 

Mental arithmetic, 279. 

Mersenne, Marin, 375, 389, 391. 

primes, 13, 391-395, 397. 

Mertens, Franz Carl Joseph, 595. 
METflFONT, 689. 

Metrology, 183-185. 

Metropolis, Nicholas Constantine, 4, 225, 
227, 310. 

Metze, Gernot, 265. 

Meyer, Albert Ronald da Silva, 301. 
Middle-square method, 3-5, 7-8, 26, 518. 
Mignotte, Maurice, 627. 

Mikusinski, Jan, 363. 

Miller, Gary Lee, 379, 380. 

Miller, Jeffrey Charles Percy, 507, 637. 
Miller, Webb Colby, 466. 

Milne-Thompson, Louis Melville, 487. 
Minimizing a quadratic form, 94-98, 105, 
111 - 112 . 

Ministep, 452. 

Minkowski, Hermann, 544. 

Minus zero, 186-187, 230, 234, 253, 590. 
Miranker, Willard Lee, 227. 


Mitchell, Gerard Joseph Francis Xavier, 

26, 30. 

MIX computer, vi, 186-187, 193, 350, 

395, 612. 

binary version, 186, 322-323, 373-374. 
floating point attachment, 199, 208-209, 
211-212, 498. 

Mixed congruential method, 10, see Linear 
congruential sequence. 

Mixed-radix number systems, 64, 183, 
192-196, 274-275, 277, 486. 
addition, 193, 266, 589. 
balanced, 100, 586. 
comparison, 274-275. 
counting in, 99-100. 
multiplication and division, 193, 589. 
radix conversion, 309-310. 

Mixture of distribution functions, 118-119, 
133-134. 

Mobius, August Ferdinand, function, 337, 
361, 437, 440. 

inversion formula, 437, 604. 
mod, 212, 305, 402, 521, 586, 666. 
mod m arithmetic, 

addition, 11, 14-15, 171, 187, 271-272. 
division, 25 (exercise 7), 277, 337, 
427-428, 480. 
halving, 277. 

multiplication, 11-15, 272, 614. 
on polynomial coefficients, 400-402. 
square root, 389, 437, 615. 
subtraction, 15, 171, 271-272. 

Model V, 209. 

Modular arithmetic, 268-278, 287-290, 
434-436, 440, 480. 

Modulus in a linear congruential sequence, 
9-15, 170. 

Moenck, Robert Thomas, 429, 486. 

Mpller, Ole, 227. 

Monahan, John Francis, 125-126, 130. 
Monic polynomial, 399, 401, 402, 405, 

436, 500. 

Monier, Louis Marcel Gino, 396, 613. 
Monomial, evaluation of, 465-466. 

Monte Carlo, 2, 53, 110, 173. 
method for factoring, 369-371, 377, 

394, 396. 

Moore, Donald Philip, 26, 30. 

Moore, Ramon Edgar, 227. 

Morris, Robert, 570. 

Morrison, Michael Allan, 380, 384. 

Morse, Harrison Reed, m, 176. 

Morse, Samual Finley Breese, code, 361. 
Moses, Joel, 435-436. 

Moses, Lincoln Ellsworth, 140. 

Most significant digit, 179. 

Motzkin, Theodor Samuel, 363, 471, 475, 
476, 478, 500, 501. 

Muller, Mervin Edgar, vi, 117, 137. 
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Multiple, 403. 

Multiple-precision arithmetic, 186, 250-301, 
309, 327-330, 339, 400. 
addition, 250-252, 262-263, 265-268. 
comparison, 266. 

division, 255-261, 263-268, 295-297. 
greatest common divisor, 327-330, 339. 
multiplication, 253-255, 266-267, 
278-301, 443. 
radix conversion, 309, 311. 
subtraction, 250-253, 265-268. 
table of constants, 659-660. 
Multiplication, 178, 189, 191, 197, 250-251, 
253-255, 266-267, 278-301. 
complex, 189, 468, 487, 501. 
double-precision, 234-237, 278-279. 
fast (asymptotically), 278-301. 
floating point, 204, 207, 208, 215-216, 
224, 226-230, 234-235, 237, 248-249. 
fractions, 266, 313, 315. 
matrix, 481-482, 487-488, 502-505, 641. 
mixed-radix, 193, 589. 
mod m, 11-15, 272, 614. 
mod u(x), 428. 
modular, 269-272. 

multiple-precision, 253-255, 266-267, 
278-301, 443. 

polynomial, 399-400, 489-494, 503, 652. 
power series, 506. 

Multiplicative congruential method, 10, 
18-21, 105. 

Multiplier in a linear congruential sequence, 
9-10, 15-25, 84-86, 98-105, 170-171. 
Multiset, 454, 464, 636. 

Multivariate polynomial, 399-400, 403, 
418-419, 436, 438-439, 479-505. 

Munro, James Ian, 496, 647. 

Musinski, Jean Elisabeth Abramson, 489. 
Musical notation, 182. 

Musser, David Rea, 264, 434, 436. 

Nadler, Morton, 268. 

Nance, Richard Earle, 173. 

Nandi, Salil Kumar, 264. 

Napier, John, baron of Marchiston, 

178, 184. 

Needham, Joseph, 271. 

Negabinary number system, 188-189, 
193-194, 196, 311. 

Negacyclic convolution, 503. 

Negadecimal number system, 188, 194. 
Negative binomial distribution, 135. 
Negative digits, 190-197, 638. 

Negative numbers, representation of, 
186-197. 

Negative radix, 188-189, 193-194, 196, 311. 
Neighborhood of a floating point number, 
218. 

Neugebauer, Otto Eduard, 180, 209. 


Newcomb, Simon, 239. 

Newman, Donald Joseph, 638. 

Newton, Sir Isaac, 431, 467, 640. 
interpolation formula, 485-486, 498. 
method for rootfinding, 264, 295, 

510, 656. 

Nickel, Laura Ann, 391. 

Niederreiter, Harald Gunther, 104, 105, 

109, 110, 113, 164, 548. 

Nijenhuis, Albert, 140. 

Nines, casting out, 273, 287, 307. 

Nines’ complement notation, 187, 194. 

Niven, Ivan Morton, 149. 

Noll, Curt Landon, 391. 

Nonary (radix 9) number system, 183, 591. 

Non commutative multiplication, 418-419, 
481, 487-496, 501-505. 

Nonnegative: Zero or positive. 

Normal deviates: Random numbers with 
the normal distribution, 117-127. 
dependent, 127, 134. 

Normal distribution, 54, 71, 117-127, 129, 
130, 134-135, 368. 

Normal evaluation scheme, 487, 650-651. 

Normal number, 164. 

Normalization of floating point numbers, 
199-208, 211-212, 223, 227, 233, 

239, 573. 

Norton, Karl Kenneth, 367. 

NP complete problem, 550, 639. 

Null space of a matrix, 425-427, 625. 

Number sentences, 562. 

Number system: A language for 
representing numbers, 
balanced decimal, 195. 
balanced mixed-radix, 100, 586. 
balanced ternary, 190-193, 211, 268, 336. 
binary (radix 2) 179, 182-186, 400, 

441, 464. 
binomial, 193. 

complex, 189-190, 193-194, 196, 

268, 401. 

decimal (= denary, radix ten), 181-183, 
194-195, 359. 

duodecimal (radix twelve), 183. 
factorial, 64, 192. 

Fibonacci, 193. 

floating point, 198-199, 206, 223, 

225, 231. 

hexadecimal (radix sixteen), 179, 

184-185, 593. 

mixed-radix, 64, 100, 183, 192-196, 
274-275, 277, 309-310, 486, 586. 
modular, 268-271. 
negabinary (radix —2), 188-189, 

193-194, 196, 311. 
negadecimal, 188, 194. 
nonary (radix 9), 183, 591. 
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octal (= octonary = octonal, radix 8), 
178, 183-186, 188, 194, 306-308, 462. 
p-adic, 197, 562, 587, 628. 
phi, 193. 

positional, 144-145, 159-160, 164, 
179-197, 302-312. 
primitive tribal, 179, 182. 
quater-imaginary (radix 2 i), 189, 

193-194, 268. 

quaternary (radix 4), 179, 183. 
quinary (radix 5), 179, 183, 197. 
rational, 313-316, 401. 
regular continued fraction, 330, 341-342, 
352-353, 358-363. 
reversing binary, 196. 
revolving binary, 196. 
sedecimal (= hexadecimal), 179, 

184-185, 593. 
senary (radix 6), 183. 
senidenary (= hexadecimal), 179, 
184-185, 593. 
septenary (radix 7), 183. 
sexagesimal (radix sixty), 180-183, 

209, 309. 

slash, 314-315, 363-364. 
ternary (radix 3), 179, 183, 190-193, 197, 
211, 268, 311, 336. 
vigesimal (radix twenty), 180. 

Nussbaiimer, Henri Jean, 503, 651. 

Nystrom, John William, 184-185. 

O-notation: 0(/(n)) denotes a quantity 
whose magnitude is less than some 
constant times fin), for all large n. 

Oakford, Robert Vernon, 140. 

Octal (radix 8) number system, 178, 
183-186, 188, 194, 306-308, 462. 

Odd-even method, 124-125, 134. 

Odd polynomial, 496. 

Odlyzko, Andrew Michael, 565. 

0FL0, 202. 

Olivos Aravena, Jorge Augusto Octavio, 
466. 

On-line algorithm, 506-510, 657. 

Ones’ complement notation, 187, 194, 261, 
262, 264, 272, 391. 

Operands: Quantities that are operated on; 
e.g., u and v in the calculation of 
u + v. 

Optimum methods of computation, 
see Complexity. 

Order of a modulo m, 19-22, 375-376. 

Order of an element in a field, 438. 

Order-of-magnitude zero, 224. 

Oriented tree, 444-446, 463, 530, 567. 

Ostrowski, Alexander Markus, 475. 

Oughtred, William, 209, 309. 


Overflow, 11-12, 202, 226, 237, 252-253, 
277, 315, 522. 

exponent, 201, 203, 206, 211, 216, 
227-228, 233. 
fraction, 201, 239, 249. 
rounding, 201, 203, 204, 207, 208, 
211 - 212 . 

Overstreet, Claude Lee, Jr., 173. 

Owen, John, 1. 

Owings, James Claggett, Jr., 166. 

p-adic numbers, 197, 562, 587, 628. 

Pade, Henri Eugene, 515. 

Palindrome, 398, 617 (exercise 2). 

Palmer, John Franklin, 206. 

Pan, Viktor Iakovlevich, 471, 473, 478, 482, 
488, 498, 501, 503, 505, 641, 644, 647, 
654, 655. 

Papadimitriou, Christos Harilaos, 639. 
Pappus of Alexandria, 209. 

Parallel computation, 270, 301, 469, 484. 
Parameter step, 475, 500. 

Pardo, see Trabb Pardo. 

Parlett, Beresford, 178. 

Parry, William, 193. 

Partial fraction expansion, 81, 492, 628. 
Partial ordering, 636. 

Partial quotients, 83, 342. 

distribution of, 345-353, 615-616. 
Partition test, 62, 72, 151. 

Pascal, Blaise, 183. 

Paterson, Michael Stewart, 301, 501. 
Patience, 174. 

Paul, Nicholas John, 123. 

Pawlak, Zdzislaw, 188, 268. 

Payafar, Mahmoud, 627. 

Payne, William Harris, 30. 

Paz, Azaria, 479. 

Peano, Giuseppe, 185. 

Pearson, Karl, 52-54. 

Pease, Marshall Carleton, EU, 642. 

Peirce, Charles Santiago Sanders, 363, 

516, 607. 

Penk, Michael Alexander, 599. 

Penney, Walter Francis, 189. 

Percentage point, 41, 43, 48, 69-70, 368. 
Perfect numbers, 389. 

Perfect square, 372. 

Period in a sequence, 7-9. 

length of, 4, 7-8, 15-22, 34-36, 392-393. 
Periodic continued fraction, 359, 398. 
Permanent, 480, 497. 

Permutation: Ordered arrangement of a 
multiset. 

encoding, 64, 75, 139. 
random, 139-141, 369, 441, 632. 
Permutation test, 64, 75, 76, 147. 

Perron, Oskar, 339. 
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Persian mathematics, 181-182, 265, 

309, 443. 

Pervushin, Rev. Ivan Mikheevieh, 391. 
Peters, Johann (= Jean) Theodor, 661. 
Pfeiffer, John Edward, 176. 

Phalen, Harold Romaine, 184. 

Phi (0), 342, 343, 496, 659-660. 

number system, 193. 

Phillips, Ernest William, 185. 

Pi (tt), 38, 144-145, 152, 154, 182, 184, 193, 
268, 342, 659-660. 

Pingala Acharya, 441. 

Pippenger, Nicholas John, 639. 

Places, 250. 

Planck, Max Karl Ernst Ludwig, constant, 
198, 211, 223, 225-226. 

Plass, Michael Frederick, 614. 

Playwriting, 174-176. 

Pocklington, Henry Cabourn, 397. 

Pointer machine, 295, 301. 

Poisson, Simeon Denis, distribution, 53, 
132-133, 135-136, 517. 

Poker test, 62, 72, 151. 

Polar coordinates, 54, 57, 118. 

Polar method, 117-118, 120, 130-131. 
Pollard, John Michael, 369, 385, 396, 

608, 652. 

Polynomial, 399-401. 
addition, 399-401. 
arithmetic modulo m, 34-35, 

400-401, 444. 
characteristic, 480. 
content of, 405-406. 
degree of, 399, 401, 410. 
derivative of, 421, 470, 631. 
discriminant of, 619, 628, 632. 
division, 401-420, 468-469, 515. 
evaluation, 466-505, 588 (exercise 8). 
factorization, 420-441. 
over a field, 401-403, 405, 417, 420-431, 
436-441. 

greatest common divisor, 405-420, 

434- 436, 440. 

interpolation, 281-282, 484-486, 490, 492, 
498, 641. 

irreducible, 403, 417, 421, 437-441. 
leading coefficient of, 399, 433, 435. 
monic, 399, 401, 402, 405, 436, 500. 
multiplication, 399-400, 489-494, 

503, 652. 

multivariate, 399-400, 403, 418-419, 436, 
438-439, 479-505. 
primitive, 404, 417. 
primitive modulo p, 28-29, 404. 
primitive part, 404-406. 
random, 417, 430, 436-437, 439, 441. 
remainder sequence, 408-418, 420, 

435- 436, 657. 
resultant, 415, 619. 


roots, 22, 416, 418, 420, 464, 

474-475, 499. 
squarefree, 421, 436, 440. 
string, 418-419. 
subtraction, 399-401. 
over a unique factorization domain, 
403-420, 431-441. 

Polynomial chains, 475-479, 499-501. 

Pope, David Alexander, 263. 

Popper, Karl Raimund, 166. 

Portable random number generator, 
171-173. 

Porter, J. W., 356-357. 

Positional representation of numbers, 
144-145, 159-160, 164, 179-197, 
302-312. 

Positive definite quadratic form, 94, 111. 

Positive semidefinite matrix, 551. 

Potency, 22-25, 50, 71, 76, 78, 88. 

Power, raising to a, see Exponentiation, 
factorial, 281-282, 497, 597, 664. 

Power series: A sum of the form 

£fc> 0 afc-z fc f see Generating function, 
manipulation of, 506-515. 

Power tree, 444-445, 462-463. 

Powers, Don M., 296. 

Powers, evaluation of, 441-466. 

Powers, Raymond E., 380, 391. 

pp: Primitive part, 404-406. 

Pr, 143, 145, 162, 166-169, 242, 249, 453. 

Pratt, Vaughan Ronald, 339, 395, 441. 

Precision: The number of digits in a 
representation. 

double, 230-237, 263-264, 278-279. 
quadruple, 237. 

single: fitting in one computer word, 199. 
unlimited, 265, 268, 314. 

Preconditioning, see Adaptation. 

Primality tests, 364, 374-380, 391-398. 

Prime element in a unique factorization 
domain, 403. 

Prime number: Integer greater than unity 
having no proper divisors, 316-317, 
353, 364, 615. 

distribution, 366-369, 395-396, 615, 
632-633. 

factorization into, 12-13, 317, 353, 
364-398, 464. 

large, 14, 276, 374-378, 388-394, 397, 
432, 480. 

Mersenne, 391-395, 397. 
theorem, 366, 615. 
useful, 276, 390, 391, 614, 652. 
verifying primality, 364, 374-380, 
391-398. 

Primitive element modulo m, 19-22. 

Primitive notations for numbers, 179, 182. 

Primitive part of a polynomial, 404-406. 

Primitive polynomial, 404, 417. 
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Primitive polynomial modulo p, 28-29, 404. 
Primitive recursive function, 159. 

Primitive root: A primitive element 
modulo p or in a finite field, 19-22, 

437, 438. 

Probabilistic algorithms, 2, 379-380, 

385-386, 396-397, 428-430, 439, 630. 
Probability: Ratio of occurrence, 142, 165. 
over the integers, 143, 145, 162, 166-169, 
242, 249, 453. 

Probert, Robert Lome, 641. 

Programming languages, 206. 

Pronouncing hexadecimal numbers, 185. 
Proof of algorithms, 265, 266, 319-320. 
Proofs, constructive versus nonconstructive, 
270, 273-274, 585. 

Proper factor, see Divisor. 

Proth, F. (or E.), 614. 

Pseudo-division of polynomials, 407, 416. 
Pseudo-random sequence, 3. 

Ptolemseus (= Ptolemy), Claudius, 181. 
Public key cryptography, 388-389. 

Purdom, Paul Walton, Jr., 519. 

Quadratic congruential sequence, 25-26, 34. 
Quadratic forms, 94, 385, 503. 
minimizing, over the integers, 94-98, 105, 
111 - 112 . 

Quadratic irrationality, 342, 359, 

380-382, 398. 

Quadratic reciprocity law, 377, 394, 

396, 614. 

Quadratic residues, 397, 638. 
Quadruple-precision arithmetic, 237. 
Quandalle, Philippe, 651. 

Quasi-random numbers, 3, 173. 
Quater-imaginary number system, 189, 
193-194, 268. 

Quaternary number system, 179, 183. 

Quick, Jonathan Horatio, 74. 

Quinary number system, 179, 183, 197. 
Quotient: \u/v\, 250-251, see Division, 
of polynomials, 402-404, 407, 416. 
partial, 83, 342, 345-353, 615-616. 
trial, 255-260, 263-264, 266-267. 

Rabin, Michael Oser, 380, 389, 396, 

397, 430. 

Rabinowitz, Philip, 264. 

Rademacher, Hans, 86. 

Radioactive decay, 6, 128, 132. 

Radix: Base of positiqnal notation, 179. 
complex, 189-190, 193-194, 268. 
irrational, 193. 

mixed, 64, 99-100, 183, 192-196, 266, 
274-275, 309-310, 486. 
negative, 188-189, 193-194, 196, 311. 


Radix conversion, 184, 188, 189, 191, 
193-194, 250, 302-312, 467, 470. 
floating point, 309-312. 
multiple-precision, 309, 311. 

Radix point, 9, 179, 182, 187-188, 192-193, 
198, 302. 

Raimi, Ralph Alexis, 242, 247. 

Raleigh, Sir Walter, 183. 

Rail, Louis Baker, 225. 

Ramage, John Gerow, 130. 

Ramanujan, Srinivasa Aiyangar, 613. 

Ramaswami, Vammi, 367. 

Ramshaw, Lyle Harold, 157. 

RAND Corporation, 2-3. 

Randell, Brian, 186, 209. 

Random bit, 11, 29-31, 35-36, 45, 

114-115, 133. 

Random combination, 136-141. 

Random direction, 130-131. 

Random function, 4-8, 369. 

Random integer, 

in a bounded set, 114-115, 171. 
among all positive integers, 143, 242, 249, 
439, 453. 

Random mapping, 4-8, 369. 

Random numbers, 1-177. 
generating nonuniform deviates, 114-136. 
generating uniform deviates, 9-37, 
170-173. 

machines for generating, 2-3, 387. 

quasi-, 3, 173. 

summary, 170-173. 

tables, 2-3, 152. 

testing, 38-113, see Testing. 

using, 1-2, 114-141, 615, 

see also Probabilistic algorithms. 

Random permutation, 139-141, 369, 

441, 632. 

Random point, in a circle, 117-118. 
in a sphere, 131. 
on an ellipsoid, 136. 
on a sphere, 130-131. 

Random polynomial, 417, 430, 436-437, 

439, 441. 

Random random-number generator, 

4-8, 25. 

Random sample, 136-141. 

Random sequence, meaning of, 2, 142-169. 
finite, 145, 161-164. 

Random waiting time, 114. 

Randomness, definitions of, 142-169. 
improving, 25, 31-34, 37. 
testing for, see Testing. 

RANDU, 25, 104, 173, 525. 

Range arithmetic, 212, 225-227, 230, 570. 

Rank, of apparition, 393. 
of a matrix, 425-427, 488, 496, 502, 625. 
of a tensor, 488-489, 494-496, 501-505. 

Rapoport, Anatol, 519. 
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Ratio method, 125-128, 135. 

Rational arithmetic, 68, 313-316, 409, 
506-507. 

Rational functions, 401, 479, 500. 
approximation and interpolation, 

420, 515. 

Rational number, 313, 401, 439. 
approximation, 314-316, 363-364. 
positional representation, 195, 311, 359. 

Real numbers, 401. 

Real time, 270. 

Realization of a tensor, 489. 

Reciprocal, 264, 295-297, 403. 
floating point, 228. 
mod m, 25, 427, 437, 595, 599. 

Reciprocal differences, 487. 

Reciprocity laws, 79, 86, 377, 394, 396, 614. 

Recorde, Robert, xi, 265. 

Rectangle-wedge-tail method, 118-123, 134. 

Rectangular distribution, see Uniform 
distribution. 

Recurrence relations, 9, 25-29, 33-37, 
246-247, 279, 280, 285-286, 288, 

295, 296-297, 300, 332-333, 392-395, 
481-482, 506, 507, 519, 568, 597, 630, 
637, 654, 655. 

Recursive method, 237, 279, 283-285, 287, 
295, 400, 481-482, 632, 652-653, 658. 

Reeds, James Alexander, III, 561. 

Rees, David, 36, 163. 

Regular continued fraction, 330, 341-342, 
352-353, 358-363. 

Reiser, John Fredrick, vii, 28, 36, 227. 

Reitwiesner, George Walter, 265. 

Rejection method, 120-123, 129, 

134-135, 553. 

Relative error, 206, 213, 216-217, 237, 240. 

Relatively prime: Having no common prime 
factors, 11, 313, 324, 404, 417. 

Remainder: Dividend minus quotient times 
divisor, 250-251, 256, 402, 407, 416, 
515, see also mod. 

Replicative law, 86. 

Representation of numbers, see Number 
system. 

Representation of trees, 463, 530, 555, 634. 

Representation of oo, 209, 230, 315, 593. 

Reservoir sampling, 138-140. 

Residue arithmetic, 269, see Modular 
arithmetic. 

Result set, 475-476, 499. 

Resultant of polynomials, 415, 619. 

Revah, Ludmila, 647. 

Reverse of a polynomial, 416, 434, 436, 

618, 657. 

Reversing binary number system, 196. 

Reversion of power series, 508-511, 514. 

Revolving binary number system, 196. 

Rezucha, Ivan, 137. 


Rhind papyrus, 443. 

Rho method, see Monte Carlo method for 
factoring. 

Rieger, Georg Johann, 605. 

Riemann, Georg Friedrich Bernhard, 78, 
366, 396. 

hypothesis, 366-367, 380, 632. 
integration, 146-147, 244. 

Ring with identity, commutative, 399, 401, 
407. 

Riordan, John, 520. 

Rivest, Ronald Linn, 386, 648. 

Robber, 174-176. 

Robinson, Donald Wilford, 528. 

Robinson, Julia Bowman, 616. 

Robinson, Raphael Mitchel, 614, 652. 
Roman numerals, 179, 193, 679. 

Romani, Francesco, 482. 

Roof, Raymond Bradley, 110. 

Roots of a polynomial, 22, 416, 418, 420, 
464, 474-475, 499. 

Roots of unity, see Cyclotomic 

polynomials, Exponential sums. 

Ross, Douglas Taylor, 176. 

Rotenberg, Aubey, 10, 45. 

Roulette, 9, 53, 240. 

Rounding, 201, 203, 206, 207, 212, 215-216, 
219, 221-222, 226, 314-316, 363-364. 
Rounding overflow, 201, 203, 204, 207, 208, 
211 - 212 . 

Rozier, Charles P., 308. 

RSA box, 386-389, 397. 

Rudolff, Christof, 182. 

Rumely, Robert Scott, 380. 

Run test, 61, 65-68, 72, 74, 88, 151, 167. 
Runge, Carl David Tolme, 642. 

Runs above (or below) the mean, 61. 
Russian peasant method, 443. 

Ruzsa, Imre Zoltan, 197. 

Ryser, Herbert John, 497, 641. 

Sachau, Karl Eduard, 441. 

Sahni, Sartaj, 58. 

Saidan, Ahmad Salim, 182, 441. 

Salamin, Eugene, 268. 

Samelson, Klaus, 226-227, 310. 

Samet, Paul Alexander, 304. 

Sampling (without replacement), 

1, 136-141. 
weighted, 141. 

Sands, Arthur David, 567. 

Savage, John Edmund, 648. 

Sawtooth function, 77, 86. 

Saxe, James Benjamin, 136. 

Scarborough, James Blaine, 226. 

Schmid, Larry Philip, 71. 

Schmidt, Erhard, orthogonalization process, 
97, 620. 

Schmidt, Wolfgang M., 169. 
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Schnorr, Claus-Peter, 166, 397, 478. 

Scholz, Arnold, 459. 

Schonhage, Arnold, 276, 287-288, 290, 295, 
300, 301, 311, 451, 465, 482, 592, 598, 
638, 655. 

Schreyer, Helmut, 186. 

Schroder, Ernst, 512. 
function, 512-513. 

Schroeppel, Richard Crabtree, 383, 384. 
Schwartz, Jacob Theodore, 619. 

Schwarz, Stefan, 430. 

Secrest, Don, 265, 310. 

Secret keys, 177, 386-389, 397, 486. 

Secure communications, 386-389, 397. 
Sedecimal number system, 186, 
see Hexadecimal. 

Sedgewick, Robert, 518. 

Seed (starting value) in a linear 

congruential sequence, 9, 15, 19, 137, 
170. 

Seidenberg, Abraham, 182. 

Selection sampling, 137-138, 140. 

Selfridge, John Lewis, 378, 395. 

Semigroup, 517. 

Senidenary number system, 186, 
see Hexadecimal. 

Septenary (radix 7) number system, 183. 
Serial correlation test, 70-72, 75, 78-85, 

148, 168. 

Serial test, 36, 60, 72, 73, 85-88, 91, 
105-110, 113, 151. 

Sethi, Ravi, 466. 

SETUN, 192. 

Sexagesimal number system, 180-183, 

209, 309. 

Shakespeare, William, v. 

Shallit, Jeffrey Outlaw, 363. 

Shamir, Adi, 386, 398, 486. 

Shanks, Daniel Charles, 268, 360, 384, 385, 
626. 

Shannon, Claude Elwood, 195. 

Shaw, Mary Margaret, 470, 479, 497. 

Sheriff, 174-176. 

Shift operators of MIX, 322. 

Shift register recurrence, 29-31, 35, 424. 
Shirley, John William, 183. 

Shuffling a random sequence, 31-35, 37. 
Shuffling cards, 139-141. 

Sibuya, Masaaki [$ § $ 0g], 128. 

Sieve procedure, 373-374, 394. 

Sieveking, Malte, 656. 

Signatures, Digital, 388-389. 
Signed-magnitude representation, 186-187, 
193-194, 198, 232, 250. 

Significant digits, 179, 213, 223. 

Sikdar, Kripasindhu, 310. 

Silver, Roland Lazarus, 654. 

Simulation, 1. 

Sine, 471. 


Singh, Avadhesh Narayan, 441. 

Singleton, Richard Collom, 642. 

Sink vertex, 461. 

Slash arithmetic, 314-316, 363-364. 

SLB (shift left AX binary), 322. 

Slide rule, 209, 240. 

Slowinski, David Allen, 391. 

Small step, 447. 

Smirnov, Nikolai Vasil’evich, 54, 55. 

Smith, David Eugene, 181, 182. 

Smith, Henry John Stephen, 598. 

Smith, J. E. Keith, 26. 

Smith, Robert Leroy, 212. 

Sobol’, Il’ia Meerovich, 519. 

Soden, Walter, 306. 

Solitaire, 174. 

Solovay, Robert Martin, 380, 396. 

Sorted uniform deviates, 130, 132, 136. 
Source vertex, 461. 

Sowey, Eric Richard, 173. 

Species of measure zero, 166. 

Spectral test, 24, 29, 89-113, 170, 530. 

algorithm for, 98-101. 

Sphere, n-dimensional, 54. 
random point on, 130-131. 
volume of, 101. 

Spherical coordinates, 57. 

SQRT box, 389, 397. 

Square root, 117, 197, 268, 342, 359, 
380-382, 389, 398, 464. 
modulo p, 437. 
power series, 507. 

Squarefree polynomials, 421, 436, 440. 
Squeeze method, 120-123, 129, 

134-135, 553. 

SRB (shift right AX binary), 322, 462. 
Stability of polynomial evaluation, 467, 

470, 471. 

Stack: Linear list with last-in-first-out 
growth pattern, 283-285. 

Standard deviation, evaluation of, 216, 229. 
Stanley, Richard Peter, 558. 

Star chain, 447, 453-457, 461, 463. 

Star step, 447. 

Stark, Richard Harlan, 210. 

Starting value in a linear congruential 
sequence, 9, 15, 19, 137, 170. 

Statistical tests, see Testing. 

Steele Jr., Guy Lewis, 589. 

Stegun, Irene Anne, 41, 661. 

Stein, Josef, 321. 

Stein, Marvin Leonard, 263. 

Stern, Moriz Abraham, 607. 

Stern-Peirce tree, 363, 608. 

Stevin, Simon, 182, 405. 

Stibitz, George Robert, 186, 209. 

Stirling, James, 

approximation, 57, 517, 605, 636. 
numbers, 62, 63, 282, 519, 624, 665. 
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Stockmeyer, Larry Joseph, 301. 

Stolarsky, Kenneth Barry, vi. 

Stone, Harold Stuart, 210. 

Stoppard, Tom, 60. 

Storage modification machines, 295, 301. 
Strassen, Volker, 290, 295, 300, 380, 396, 
478, 481, 488, 503, 648. 

Straus, Ernst Gabor, 363, 465. 

String polynomials, 418-419. 

Stroud, Arthur Howard, 265, 310. 

Sturm, Jacob Karl Franz, 416, 420, 619. 
Subbarao, Mathukumalli Venkata, 449. 
Subresultant algorithm, 410-418, 420, 
436-436, 657. 

Subsequence rule, 155-160, 162-163, 
165-166, 169. 

Subsequence tests, 65, 151. , 

Subtraction, 178, 191, 197, 250-253, 
265-268. 
complex, 468. 
continued fractions, 602. 
floating point, 200-204, 214-216, 

219-225, 230, 232-234, 238-239, 249. 
fractions, 313-315. 
mixed-radix, 193. 
mod m, 15, 171, 271-272. 
modular, 269, 277. 

multiple-precision, 250-253, 265-268. 
polynomial, 399-401. 
power series, 506. 

Subtractive random number generation, 36, 
171-173. 

Sugunamma, Mantri, 449. 

Sum of periodic sequences, 31, 35. 
Summation by parts, 597. 

Sun Tsii (= Wu) [ft f], 265, 271. 
Suokonautio, Vilho, 265. 
sup, 532. 

Svoboda, Antonin, 267, 276. 

Swedenborg, Emanuel, 184. 

Sweeney, Dura W., 238, 360. 
Swinnerton-Dyer, Henry Peter Francis, 625. 
Sykora, Ondrej, 641. 

Sylvester, James Joseph, 415, 417. 

Synthetic division, 402. 

System/370, 14-15, 104. 

Szabo, Nicholas Sigismund, 275, 276. 
Szymanski, Thomas Gregory, 518. 

t-distribution, 130. 

Tables of fundamental constants, 342, 584 
(exercise 36), 614, 659-662. 

Tague, Berkley Arnold, 401. 

Tail of a floating point number, 220. 

Tail of the binomial distribution, 160. 

Tail of the normal distribution, 

122-123, 134. 

Takahasi, Hidetosi [(§ ft % f§], 275. 

Tanaka, Richard Isamu [ffl + %}, 276. 


Tangent, 360, 647. 

Tannery, Jules, 226. 

Taranto, Donald Howard, 310. 

Tarski, Alfred, 502. 

Taussky Todd, Olga, 104. 

Tausworthe, Robert Clem, 30. 

Taylor, Alfred Bower, 185. 

Taylor, Brook, 470. 

Television script, 174-176. 

Ten’s complement notation, 186-187, 194. 
Tensor, 487-496, 501-505. 

Terminating fractions, 311. 

Ternary number system, 179, 183, 190-193, 
197, 211, 268, 311, 336. 
balanced, 190-193, 211, 268, 336. 

Testing for randomness, 38-113. 
a priori tests, 75. 

chi-square test, 39-45, 50-54, 56-59. 
collision test, 68-70, 72-73, 151. 
coupon collector’s test, 61-63, 74, 

151, 167. 

empirical tests, 59-75. 
equidistribution test, 59, 72. 
frequency test, 59, 72. 
gap test, 60-61, 72-73, 131, 151, 167. 
Kolmogorov-Smirnov test, 45-52, 54-58, 
59, 68. 

maximum-of-t test, 49, 51, 57, 68, 74, 
117, 151, 167. 
partition test, 62, 72, 151. 
permutation test, 64, 75, 76, 147. 
poker test, 62, 72, 151. 
run test, 61, 65-68, 72, 74, 88, 151, 167. 
serial correlation test, 70-72, 75, 78-85, 
148, 168. 

serial test, 36, 60, 72, 73, 85-88, 91, 
105-110, 113, 151. 

spectral test, 24, 29, 89-113, 170, 530. 
subsequence tests, 65, 151. 
theoretical tests, 75-113. 

T^X, vii, 689. 

Thacher, Henry Clarke, Jr., 510. 
Theoretical tests for randomness, 75-113. 
Thiele, Thorvald Nicolai, 487. 

Thompson, John Eric Sidney, 180. 
Thomson, William Ettrick, 3, 10, 21. 
Thurber, Edward Gerrish, 451, 458, 459. 
Tienari, Martti, 265. 

Tingey, Fred Hollis, 55. 

Tippett, Leonard Henry Caleb, 2. 

Tobey, Robert George, 622. 

Tocher, Keith Douglas, 553. 

Toeplitz, Otto, 657. 

Tonal system, 185. 

Tonelli, Alberto, 626. 

Toom, A. L., 280, 282, 284, 290. 
Topological sorting, 461. 

Torres y Quevedo, Leonardo de, 209. 

Trabb Pardo, Luis Isidoro, 369, 611. 
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Trager, Barry Marshall, 631. 

Trailing digit, 179. 

Transcendental numbers, 363. 

Transpose, 488, 664. 

Traub, Joseph Frederick, 133, 335, 380, 

410, 470, 479, 486, 497, 512-513, 514, 
515, 656. 

Trees: Branching information structures, 
binary, 363, 555, 630. 
enumeration of, 639. 
oriented, 444-446, 463, 530, 567. 
representation of, 463, 530, 555, 634. 

Trial quotient, 255-260, 263-264, 266-267. 
Triangularization of matrices, 425-427, 

621, 625. 

Trie, 630. 

Trilinear representation of tensors, 503. 
Triple-precision floating point, 237. 

Trits, 190. 

Tropfke, Johannes, 339. 

Truncation: Suppression of trailing digits, 
222, 293. 

Tsu Chhung-Chih 181. 

Tuckerman, Bryant, 391. 

Tukey, John Wilder, 642. 

Tung Yun Mei [1 $* H], 689. 

Turan, Paul, 602. 

Turing, Alan Mathison, 561. 

machine, 164, 480. 

Twindragon, 190, 564. 

Two’s complement notation, 14-15, 187, 
194, 197, 212, 261, 262. 

Ulam, Stanislaw Marcin, 135. 

Ulp, 217. 

Ultimately periodic sequences, 7-8, 21, 

359, 369. 

Underflow, exponent, 201, 203, 206, 211, 
216, 227-228, 233. 
gradual, 206. 

Ungar, Peter, 647. 

Uniform deviates, 9-37, 116-117, 170-173. 
Uniform distribution, 2, 9, 45, 47, 55, 114, 
116-120, 133, 248. 

Unique factorization domain, 403-405, 417. 
Unit in a unique factorization domain, 

403, 417. 

Unlimited precision, 265, 268, 314. 
Unnormalized floating point arithmetic, 
223-225, 227, 229, 310. 

Useful primes, 276, 390, 391, 614, 652. 
Uspensky, James Victor, 264. 

Valach, Miroslav, 276. 

Valiant, Leslie Gabriel, 480. 

Vallee Poussin, Charles Louis Xavier 
Joseph de la, 366. 

Valtat, Raymond, 186. 
van Ceulen, Ludolph, 182. 


Van de Wiele, Jean-Paul, 478, 648. 

van der Corput, Johannes Gualtherus, 157. 

van der Waerden, Bartel Leendert, 180, 

415, 499, 632. 

van Leeuwen, Jan, 497, 647. 
van Wijngaarden, Adriaan, 227. 

Vari, Thomas Michael, 642. 

Variance-ratio distribution, 130. 

Vaughan, Robert Charles, 433. 

Veltkamp, Gerhard W., 573. 

Vertex cover, 466. 

Vicente Gonsalves, Jose, 627. 

Vigesimal number system, 180. 

Viete, Francois, 182. 

Ville, Jean, 560. 

Voltaire, Francois Marie Arouet de, 184. 
von Fritz, Kurt, 318. 
von Mangoldt, Hans Carl Friedrich, 613. 
function, 355, 361. 

von Mises, Richard, edler, 142, 165, 475. 
von Neumann, John, 1, 3, 26, 114, 120, 124, 
135, 186, 210, 263, 310. 
von Schelling, Hermann, 63. 
von Schubert, Friedrich Theodor, 431. 

Wadel, Louis Burnett, 188. 

Wadey, Walter Geoffrey, 210, 227. 

Waiting time, 131. 

Wakulicz, Andrzej, 188, 268. 

Wald, Abraham, 157, 165. 

Wales, Francis Herbert, 178, 186. 

Walfisz, Arnold, 366. 

Walker, Alastair J., 115, 122, 134, 555. 
Walker, Andrew Morris, 71. 

Wall, Donald Dines, 527. 

Wall, Hubert Stanley, 339. 

Wallace, Christopher S., 299. 

Wallis, John, 182-183. 

Walsh, Joseph Leonard, 483. 

Wang, Paul Shyh-Horng, 436, 631. 

Ward, Morgan, 528. 

Waterman, Alan Gaisford, vii, 37, 104, 111, 
139, 529. 

Waterman, Michael Spencer, 608. 

Watson, Eric John, 30. 

Weather, 72. 

Wedge-shaped distributions, 120-121, 
Weigel, Erhard, 183. 

Weighing problem, 192. 

Weights and measures, 182-185, 193, 310. 
Weinberger, Peter Jay, 380, 397, 632. 

Welch, Lloyd Richard, 430. 

Welch, Peter Dunbar, 642. 

Welford, B. P., 216. 

Westlake, Wilfred James, 31. 

Weyl, Hermann, 168, 366, 592. 

Wheeler, David John, 210. 

White, Jon L, 589. 

White sequence, 168. 
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Whiteside, Derek Thomas, 467. 

Wilf, Herbert Saul, 140. 

Wilkes, Maurice Vincent, 185, 210. 
Wilkinson, James Hardy, 226, 480. 
Williams, Hugh Cowie, 378, 397. 

Williams, John Hayden, 519. 

Williamson, Dorothy, 110. 

Winograd, Shmuel, vii, 265, 299, 481, 482, 
488, 490, 494, 495, 496, 502, 505, 641, 
646, 647, 654. 

Wirsing, Eduard, 347, 350, 361. 

WM1 (word size minus one), 15, 236, 253. 
Wolf, Thomas H., 176. 

Wolfowitz, Jacob, 67, 72. 
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